Part I – Introduction

In the US, any commercial carrier providing services to/from the US and International points, has a mandate by the Department of Transportation to report on certain metrics. These information is collected and disseminated by the Office of Aviation Analysis in DC.

Airlines need to reoport various figures - from ontime performance, financial data, fare/ticket prices, to other industry standard metrics. Because the pandemic has been really damaging to the industry. I opted to briefly examine what the tendencies and or trends look like for the industry by looking at passenger and flight data since 1990. I initially sought the information on Kagle, but was also able to find it readily accessible in the department of Transportation. Much of the time, I have spent it with data preparation - from loading files, to updating them, and joining some of the data.

The following presents the details of the project, and provide some insights on the traffic in commercial air transport.

Part II – The Dataset

The data sets I obtained contained about 1.5 million records with more than 20 categorical, and numerical variables. The files were manipulated with R, and data was validated within the R studio environment as well as excel. (given the large amount of records, I chose to validate the information carefully)

Data Upload and Wrangling

Before I start manipulating the data I will load tidyverse in case I need to use dplyr or some other tidyverse package.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Here I upload the data, and create a data frame called T100_Pax containing the number of passengers transported between the US & International destionations by any airline operating any given route, + other relevant variables.

T100_Pax <- read.csv("/Users/cruz-diazgroup/Desktop/1. William/1. School/DATA SCIENCE/2. DATA 110 Vis. & Com./3. Projects/US_Intl_AirTraffic_1990_2020/International_Report_Passengers.csv", check.names = FALSE, header = TRUE, sep = ",") 

Here I upload the data with the information flights operated between US points and International destionations by any airline

T100_Depts <- read.csv("/Users/cruz-diazgroup/Desktop/1. William/1. School/DATA SCIENCE/2. DATA 110 Vis. & Com./3. Projects/US_Intl_AirTraffic_1990_2020/International_Report_Departures.csv", check.names = FALSE, header = TRUE, sep = ",") 

In this step I am loading World Area Codes (WAC codes) associated with airport codes to “decode” and obtain proper location names for the Int’l destinations. I sourced this file from the Bureau of Transportation Statistics - Aviation Support Tables: (https://www.transtats.bts.gov/Tables.asp?QO_VQ=IMI&QO_anzr=N8vn6v10%FDf722146%FDgnoyr5&QO_fu146_anzr=N8vn6v10%FDf722146%FDgnoyr5)

WAC_codes <- read.csv("/Users/cruz-diazgroup/Desktop/1. William/1. School/DATA SCIENCE/2. DATA 110 Vis. & Com./3. Projects/US_Intl_AirTraffic_1990_2020/807633242_T_WAC_COUNTRY_STATE.csv", check.names = FALSE, header = TRUE, sep = ",") 

Unfortunately I had to source a forth file in order to obtain the names of all the airline codes listed in the data. The list of codes was incomplete for the main source files, therefore I had to resort to getting it through a different source - wikipedia. This was the most complete list after exhausting other alternatives (wikipedia https://en.wikipedia.org/wiki/List_of_airline_codes). The data had to be copied and pasted into a csv file and then imported as follows

Airline_decoder <- read.csv("/Users/cruz-diazgroup/Desktop/1. William/1. School/DATA SCIENCE/2. DATA 110 Vis. & Com./3. Projects/US_Intl_AirTraffic_1990_2020/airline_decoder.csv", check.names = FALSE, header = TRUE, sep = ",") 

I am now going to join the T100_Depts and T100_Pax with a left join in order to get departures and passengers in one data frame. (Once my data is validated and also summarised I will be able to use the 2 decoders above)

DF_Dpts_Pax <-left_join(T100_Depts, T100_Pax, by = c("data_dte", "Year", "Month", "usg_apt_id", "usg_apt", "usg_wac", "fg_apt_id", "fg_apt", "fg_wac", "airlineid", "carrier"))

Because the datasets are super large, I need to compare some of the figures/totals against the original files (and also with another data tool i.e. excel pivot) to ensure that I did not create any duplicates in the process.

So in the next 6 chunks I am checking the original source files with the newly created data frame “DF_Dpts_Pax” to review the integrity of my data and figures.

T100_Depts %>%
  group_by(Year) %>%
  summarise(sum(Total, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 31 x 2
##     Year `sum(Total, na.rm = TRUE)`
##    <int>                      <int>
##  1  1990                     683399
##  2  1991                     701843
##  3  1992                     784967
##  4  1993                     820376
##  5  1994                     844760
##  6  1995                     913350
##  7  1996                     943039
##  8  1997                     983723
##  9  1998                    1082878
## 10  1999                    1116846
## # … with 21 more rows
DF_Dpts_Pax %>%
  select(Year, Total.x) %>%
  group_by(Year) %>%
  summarise(sum(Total.x, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 31 x 2
##     Year `sum(Total.x, na.rm = TRUE)`
##    <int>                        <int>
##  1  1990                       683399
##  2  1991                       701843
##  3  1992                       784967
##  4  1993                       820376
##  5  1994                       844760
##  6  1995                       913350
##  7  1996                       943039
##  8  1997                       983723
##  9  1998                      1082878
## 10  1999                      1116846
## # … with 21 more rows

I also took the time to review the data integrity in excel so that I could compare the files against my work in R to ensure no duplicity of records or mistakes in figures when doing the data wrangling.

library(png)
library(grid)
Flts_img <- readPNG("/Users/cruz-diazgroup/Desktop/1. William/1. School/DATA SCIENCE/2. DATA 110 Vis. & Com./3. Projects/US_Intl_AirTraffic_1990_2020/Flts_validation.png")
grid.raster(Flts_img)

T100_Pax %>%
  group_by(Year) %>%
  summarise(sum(Total, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 31 x 2
##     Year `sum(Total, na.rm = TRUE)`
##    <int>                      <int>
##  1  1990                   84425428
##  2  1991                   81771514
##  3  1992                   89642833
##  4  1993                   94140075
##  5  1994                   98550584
##  6  1995                  105521802
##  7  1996                  113248651
##  8  1997                  120323779
##  9  1998                  125299971
## 10  1999                  133208907
## # … with 21 more rows
DF_Dpts_Pax %>%
  select(Year, Total.y) %>%
  group_by(Year) %>%
  summarise(sum(Total.y, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 31 x 2
##     Year `sum(Total.y, na.rm = TRUE)`
##    <int>                        <int>
##  1  1990                     84425428
##  2  1991                     81771514
##  3  1992                     89642833
##  4  1993                     94140074
##  5  1994                     98550584
##  6  1995                    105521802
##  7  1996                    113248537
##  8  1997                    120323779
##  9  1998                    125299971
## 10  1999                    133208907
## # … with 21 more rows

Again - embedding an image of the data summarized in excel to compare against the data worked in R.

library(png)
library(grid)
pax_img <- readPNG("/Users/cruz-diazgroup/Desktop/1. William/1. School/DATA SCIENCE/2. DATA 110 Vis. & Com./3. Projects/US_Intl_AirTraffic_1990_2020/Pax_validation.png")
grid.raster(pax_img)

As a last step, I will also rename 2 important variables so that I do not get them confused: “Total.x” and “Total.y”, which represent the Total number of flights, and Total number of passengers respectively.

DF_Dpts_Pax <- DF_Dpts_Pax %>%
  rename(
    Total_Flights = Total.x, 
    Total_Passengers = Total.y
  )

Part III – Insights

International travel to/from US overtime:

The following graph has been produced to illustrate the effects of “World” events on international travel to/from the US. Although the flights/passenger figures for 2020 are not complete (only Q1 data has been made available due to Airlines-Government confidentiality - data lag) other years can provide some insights to the effects of “World Casualties” impacting the travel industry.

For instance since the data has been made available in 1990, there have been at least 3 major events that have impacted commercial aviation negatively:

  • 1990-1991: Gulf War lasting from August 1990 - February 1991
  • 2001: September 11 terrorist attacks in the US
  • 2007-2009: The World recession lasting from December 2007 - June 2009.

Fortunately, the data also shows that despite negative world events commercial air transport has been resilient it has recovered as it can be observed from its exponential growth in the last decade. The question for post-covid travel is not “if,” but when travel will rebound.

To contruct the graph I first start by aggregating my dataset and segmenting a couple of variables to show at a “high level” what’s happened to International travel over the last 3 decades.

LastDecs <- aggregate(
  x = DF_Dpts_Pax[c("Total_Flights", "Total_Passengers")], 
  by = DF_Dpts_Pax[c("Year")], 
  FUN = sum, na.rm = TRUE
)

I then create a graph using 2 type of geoms - bar and line.

LastDecs %>%
  ggplot() +
  geom_bar(mapping = aes(x = Year, y = Total_Passengers), stat = "identity", fill = "light blue") +
  geom_line(mapping = aes(x = Year, y = Total_Flights*100), size = .5, color = "blue") + 
  scale_y_continuous(name = "Int'l Passengers (Totals / Year)", 
    sec.axis = sec_axis(~./100, name = "Int'l Flights (x100)", 
      labels = function(b) { paste0(round(b * 100, 0), "")})) + 
  theme(
      axis.title.y = element_text(color = "grey"),
      axis.title.y.right = element_text(color = "grey")) +
    labs(
    title = "Total Travel between US and Int'l Destinations 1990 - 2020"
  ) 

In addition to an aggregated data set, I wanted to examine whether there had been a shift in “destinations” or regions of the World over time. Thus I created another data subset to examine if in fact there has been a major change in air travel accross international points.
As it is seen below, much of the top nations for international travel remain at the top. However in some regions new destinations are really booming. In Europe for instance, Iceland and Portugal with have started to show up on the map with significant increase in traffic in the last few years, and in the Far East China has really taken off since 2010. Perhaps the increase in traffic could be attributed to improved GDP, I will have to investigate in the future to make that assertion.

In the chunk below, I went ahead and aggregated data before joining it with foreign location variables that would point out where traffic between the US and International points has shifted.

Intl_Dests <- aggregate(
  x = DF_Dpts_Pax[c("Total_Flights", "Total_Passengers")], 
  by = DF_Dpts_Pax[c("Year", "fg_wac")], 
  FUN = sum, na.rm = TRUE
)

Having aggregated the data, I now can cross reference the foreign location variable “fg_wac, with the WAC_codes data frame. I will store a left join data reference under”Intl_Dests" data frame.

Intl_Dests <- left_join(Intl_Dests, WAC_codes, by = c("fg_wac" = "WAC"))

In the below treemaps I have segmented international destinations for which passenger travel exceeded 250,000 on a given year for every 5 years since 1990. The countries are listed by their correspoinding world regions, and the treemap shade goes from darker to lighter as more flights are within a specific country/region

library(ggplot2)
library(treemapify)
library(ggfittext)
T1990 <- Intl_Dests %>%
  filter(Year == 1990, Total_Passengers > 250000) %>%
  ggplot(aes(area = Total_Passengers, fill = Total_Passengers, label = COUNTRY_SHORT_NAME, 
                       subgroup = WORLD_AREA_NAME)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour = "gray", 
                    frontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  labs(
    title = " Year 1990 ",
    fill = "Passenger totals"
  )
## Warning: Ignoring unknown parameters: frontface
T1995 <-Intl_Dests %>%
  filter(Year == 1995, Total_Passengers > 250000) %>%
  ggplot(aes(area = Total_Passengers, fill = Total_Passengers, label = COUNTRY_SHORT_NAME, 
                       subgroup = WORLD_AREA_NAME)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour = "gray", 
                    frontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  labs(
    title = "Year 1995", 
    fill = "Passenger totals"
  ) 
## Warning: Ignoring unknown parameters: frontface
T2000 <- Intl_Dests %>%
  filter(Year == 2000, Total_Passengers > 250000) %>%
  ggplot(aes(area = Total_Passengers, fill = Total_Passengers, label = COUNTRY_SHORT_NAME, 
                       subgroup = WORLD_AREA_NAME)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour = "gray", 
                    frontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  labs(
    title = "Year 2000",
    fill = "Passenger totals"
  ) 
## Warning: Ignoring unknown parameters: frontface
T2005 <- Intl_Dests %>%
  filter(Year == 2005, Total_Passengers > 250000) %>%
  ggplot(aes(area = Total_Passengers, fill = Total_Passengers, label = COUNTRY_SHORT_NAME, 
                       subgroup = WORLD_AREA_NAME)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour = "gray", 
                    frontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  labs(
    title = "Year 2005",
    fill = "Passenger totals"
  ) 
## Warning: Ignoring unknown parameters: frontface
T2010 <- Intl_Dests %>%
  filter(Year == 2010, Total_Passengers > 250000) %>%
  ggplot(aes(area = Total_Passengers, fill = Total_Passengers, label = COUNTRY_SHORT_NAME, 
                       subgroup = WORLD_AREA_NAME)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour = "gray", 
                    frontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  labs(
    title = "Year 2010",
    fill = "Passenger totals"
  ) 
## Warning: Ignoring unknown parameters: frontface
T2015 <- Intl_Dests %>%
  filter(Year == 2015, Total_Passengers > 250000) %>%
  ggplot(aes(area = Total_Passengers, fill = Total_Passengers, label = COUNTRY_SHORT_NAME, 
                       subgroup = WORLD_AREA_NAME)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour = "gray", 
                    frontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  labs(
    title = "Year 2015",
    fill = "Passenger totals"
  ) 
## Warning: Ignoring unknown parameters: frontface
T2019 <- Intl_Dests %>%
  filter(Year == 2019, Total_Passengers > 250000) %>%
  ggplot(aes(area = Total_Passengers, fill = Total_Passengers, label = COUNTRY_SHORT_NAME, 
                       subgroup = WORLD_AREA_NAME)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour = "gray", 
                    frontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  labs(
    title = "Year 2019",
    fill = "Passenger totals"
  ) 
## Warning: Ignoring unknown parameters: frontface
facet_grid(align ~ Year, margins = TRUE)
## <ggproto object: Class FacetGrid, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetGrid, Facet, gg>
library(patchwork)
(T1990 + T1995 + 
   plot_layout(ncol = 2, nrow = 1) +
   plot_annotation(title = "Passenger Traffic between the US and Int'l Destinations", 
                    subtitle = "Years 1990, 1995", 
                    caption = "https://www.kaggle.com/parulpandey/us-international-air-traffic-data. 
                    The area of each tile represents the size of passenger traffic as a proportion of all countries in the Region"))

(T2000 + T2005 + 
   plot_layout(ncol = 2, nrow = 1) +
   plot_annotation(title = "Passenger Traffic between the US and Int'l Destinations", 
                    subtitle = "Years 2000, 2005", 
                    caption = "https://www.kaggle.com/parulpandey/us-international-air-traffic-data. 
                    The area of each tile represents the size of passenger traffic as a proportion of all countries in the Region"))

(T2010 + T2015 + 
   plot_layout(ncol = 2, nrow = 1) +
   plot_annotation(title = "Passenger Traffic between the US and Int'l Destinations", 
                    subtitle = "Years 2010, 2015", 
                    caption = "https://www.kaggle.com/parulpandey/us-international-air-traffic-data. 
                    The area of each tile represents the size of passenger traffic as a proportion of all countries in the Region"))

(T2019 + 
   plot_layout(ncol = 2, nrow = 1) +
   plot_annotation(title = "Passenger Traffic between the US and Int'l Destinations", 
                    subtitle = "Years 2019", 
                    caption = "https://www.kaggle.com/parulpandey/us-international-air-traffic-data. 
                    The area of each tile represents the size of passenger traffic as a proportion of all countries in the Region"))

The above treemaps were of particular interest to me as I had wanted to present these as “animation - gifs” with the years 1990-2020 changing one treemap. However, it has been impossible to attain - in part because the data set slowed my computer a lot so I spent 60% of my time working on the data wrangling, which is nevertheless a good practice.

I believe the snags I run into are a good motivation to continue to investigagte and work with R until I master it.