Let’s start by loading packages and reading the data:
library(ggplot2)
library(dplyr)
flights <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-12/flights.csv')
head(flights)
## YEAR MONTH_NUM MONTH_MON FLT_DATE APT_ICAO APT_NAME
## 1 2016 1 JAN 2016-01-01T00:00:00Z EBAW Antwerp
## 2 2016 1 JAN 2016-01-01T00:00:00Z EBBR Brussels
## 3 2016 1 JAN 2016-01-01T00:00:00Z EBCI Charleroi
## 4 2016 1 JAN 2016-01-01T00:00:00Z EBLG Liège
## 5 2016 1 JAN 2016-01-01T00:00:00Z EBOS Ostend-Bruges
## 6 2016 1 JAN 2016-01-01T00:00:00Z EDDB Berlin - Brandenburg
## STATE_NAME FLT_DEP_1 FLT_ARR_1 FLT_TOT_1 FLT_DEP_IFR_2 FLT_ARR_IFR_2
## 1 Belgium 4 3 7 NA NA
## 2 Belgium 174 171 345 174 161
## 3 Belgium 45 47 92 45 45
## 4 Belgium 6 7 13 NA NA
## 5 Belgium 7 7 14 NA NA
## 6 Germany 98 99 197 NA NA
## FLT_TOT_IFR_2 Pivot.Label
## 1 NA Antwerp (EBAW)
## 2 335 Brussels (EBBR)
## 3 90 Charleroi (EBCI)
## 4 NA Liège (EBLG)
## 5 NA Ostend-Bruges (EBOS)
## 6 NA Berlin - Brandenburg (EDDB)
names(flights)
## [1] "YEAR" "MONTH_NUM" "MONTH_MON" "FLT_DATE"
## [5] "APT_ICAO" "APT_NAME" "STATE_NAME" "FLT_DEP_1"
## [9] "FLT_ARR_1" "FLT_TOT_1" "FLT_DEP_IFR_2" "FLT_ARR_IFR_2"
## [13] "FLT_TOT_IFR_2" "Pivot.Label"
Let’s aggregate up to the country-month level.
flights %>%
group_by(STATE_NAME,YEAR,MONTH_NUM) %>%
dplyr::summarise(total_flights = sum(FLT_TOT_1,na.rm=T)) -> agg
## `summarise()` regrouping output by 'STATE_NAME', 'YEAR' (override with `.groups` argument)
head(agg)
## # A tibble: 6 x 4
## # Groups: STATE_NAME, YEAR [1]
## STATE_NAME YEAR MONTH_NUM total_flights
## <fct> <int> <int> <int>
## 1 Albania 2016 1 1666
## 2 Albania 2016 2 1507
## 3 Albania 2016 3 1676
## 4 Albania 2016 4 1680
## 5 Albania 2016 5 1710
## 6 Albania 2016 6 1720
Get our date variable set up…
date_base <- paste0(agg$YEAR,"-",agg$MONTH_NUM,"-01")
agg$date <- as.Date(date_base,format = "%Y-%m-%d")
And plot the total flights over time:
agg %>%
ggplot(aes(x=date, y= log(total_flights), col = as.factor(STATE_NAME))) +
geom_point() +
geom_line() +
theme_bw() +
labs(col = "Year") +
ylab("Log of Total Flights") +
xlab("Date") +
theme(legend.position = "none") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
ggtitle("Log of Total Flights by Country Over Time")
Something worth note is how, when placed on the log scale, the flights across country-months show remarkable parallel trends on average (if you squint you can see some with more severe cycles than others), both before and after the pandemic. This implies that countries have roughly stable “shares” of air transport, although this regime partially broke down when the pandemic rolled through and through the early period of the recovery (much more mixing in 2021). We also see the onset of the war in Ukraine (the one observation which drops to zero).
This can be visualized by taking a look at the total flight rankings over time, specifically the rank-variability post-2020.
agg$post_2020 <- agg$date >= "2020-01-01"
agg %>%
arrange(total_flights) %>%
group_by(YEAR,MONTH_NUM) %>%
dplyr::mutate(rank = row_number()) -> agg
agg %>%
group_by(STATE_NAME,post_2020) %>%
dplyr::mutate(vrank = var(rank,na.rm=T)) -> agg
summary(agg)
## STATE_NAME YEAR MONTH_NUM total_flights
## Albania : 77 Min. :2016 Min. : 1.00 Min. : 1
## Armenia : 77 1st Qu.:2017 1st Qu.: 3.00 1st Qu.: 2816
## Austria : 77 Median :2019 Median : 6.00 Median : 7688
## Belgium : 77 Mean :2019 Mean : 6.27 Mean : 27342
## Bosnia and Herzegovina: 77 3rd Qu.:2020 3rd Qu.: 9.00 3rd Qu.: 30024
## Bulgaria : 77 Max. :2022 Max. :12.00 Max. :215884
## (Other) :2722
## date post_2020 rank vrank
## Min. :2016-01-01 Mode :logical Min. : 1.00 Min. : 0.0000
## 1st Qu.:2017-08-01 FALSE:1968 1st Qu.:11.00 1st Qu.: 0.7373
## Median :2019-04-01 TRUE :1216 Median :21.00 Median : 1.2301
## Mean :2019-03-13 Mean :21.19 Mean : 2.6713
## 3rd Qu.:2020-10-08 3rd Qu.:32.00 3rd Qu.: 2.8613
## Max. :2022-05-01 Max. :42.00 Max. :31.9236
##
high_var <- unique(agg[which(agg$vrank >= quantile(agg$vrank,0.9) & agg$post_2020 == T),"STATE_NAME"])
agg[,"high_var"] <- F
agg[which(agg$STATE_NAME %in% unlist(high_var)),"high_var"] <- T
Here is the messy pre-subsetting plot:
agg %>%
ggplot(aes(x=date,y=rank,col=STATE_NAME)) +
geom_point() +
geom_line() +
theme_bw() +
theme(legend.position = "none") +
ylab("Rank Order of Total Flights") +
xlab("Date") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
ggtitle("Flight Total Rank Orders")
That’s quite cluttered, but we can see a cluster of rank changes post 2020. Here is a little bit more focused view:
agg %>%
filter(high_var == T) %>%
ggplot(aes(x=date,y=rank,col=STATE_NAME)) +
geom_point() +
geom_line() +
theme_bw() +
ylab("Rank Order of Total Flights") +
xlab("Date") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
labs(col = "State") +
ggtitle("10% Most Variable Flight Total Rank Orders Post 2020")
What do we learn? Prior to 2020, many countries had either highly cyclical rankings or near constant rankings. After 2020 and through the recovery, the rankings of a number of states became highly variable, states like Bulgaria increasing their overall rank in the industry while states like Slovania suffered. Other states, like Greece and Cyprus, were less impacted in this way, their rankings seeming do depend on tourist schedules rather than a definite post-covid regime change.