I’ll first pull in the tidyverse, which contains both dplyr and tidyr for tidying data sets, as well as additional tools for analysis and visualization downstream.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Next I’ll import my data, which I created in Google Sheets–replicating the format from the pdf–and exported as a .csv file.
df <- read.csv('week5_assignment_messy_data.csv')
df
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
First, I need to add column names to the two columns with empty headers.
colnames(df)[1] <- 'airline'
colnames(df)[2] <- 'status'
df
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Next, I’ll drop the long, fully null row between the two airlines. It is important to note that this function drops rows with any nulls, so I would use caution with a larger file where nulls may appear elsewhere.
df <- drop_na(df)
df
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
In the airline column, the airline is only named once and implied to apply down the table (until another value is named). I’ll use the fill function in tidyr to mitigate this.
#make sure blank strings are interpreted as null
df$airline[df$airline == ''] <- NA
#fill
df <- df %>%
fill(airline)
df
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Because the data is in a ‘wide’ format, I need to pivot it ‘longer’. Isolating all the columns that account for cities, I can put them in one column and assign the name ‘city,’ while each number of flights will fall under ‘count.’
df <- df %>%
pivot_longer(
cols = seq(3,7),
names_to = 'city',
values_to = 'count'
)
df
## # A tibble: 20 × 4
## airline status city count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
The city names in the table now have periods instead of spaces–a holdover from when they were column names, which cannot have spaces–I will use gsub to get spaces back into the values.
df$city <- gsub('\\.', ' ', df$city)
df
## # A tibble: 20 × 4
## airline status city count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
First, I want to make a plot demonstrating the volume of flights for each city, split by airline. Additionally, I’ll color each bar based on the split of on-time vs. delayed flights.
ggplot(df) +
geom_col(aes(city, count, fill = status)) +
facet_wrap(~airline) +
coord_flip() +
scale_fill_manual('status',
values = c('on time' = "deepskyblue3",
'delayed' = "deepskyblue4"))
From this chart, we can see that Alaska Airlines had many more flights into Seattle than AM West, while AM West was massively skewed toward Phoenix.
Because delayed flights are a relatively small percentage of flights across cities, it can be difficult to tell the frequency of delays across cities visually. To account for this, I can use position = ‘fill’ as an argument of aes within geom_col() to make each bar fill the graph 100%. Regardless of volume, we can now get a sense of the percentage of arrivals into each city were delayed, split by the two airlines.
ggplot(df) +
geom_col(aes(city, count, fill = status),
position = 'fill') +
facet_wrap(~airline) +
coord_flip() +
scale_fill_manual('status',
values = c('on time' = "deepskyblue3",
'delayed' = "deepskyblue4"))
We get several additional learnings from this chart: