This report analyzes U.S. international and domestic air passenger traffic data from 2018 through October 2020. The dataset shows flight and passenger records across regional, major, and international airlines that passed through TSA checkpoints. The goal of this analysis is to use data visualizations to uncover trends in air travel and show the impact that the COVID-19 pandemic had on the aviation industry beginning in early 2020.
The data, secured from https://www.kaggle.com/datasets/dkungu/tsa-passengers, used in this analysis comes from three separate CSV files covering the years 2018, 2019, and 2020. It should be noted that the dataset provided for 2020 only goes up to October of the 2020 calendar year. Each file contains records of passenger counts, origin countries, flight months, and airline information. The three files were combined into a single dataset for analysis. All charts exclude domestic U.S. origin travel in order to focus on the overall impact that the pandemic had on closing the U.S. borders. By keeping domestic U.S. flights and their TSA Passenger counts, there would be risk of intermingling additional flights where someone flies in internationally than immediately hops on another plane to get to a certain location.
The following five charts explore different dimensions of the data, from total flight activity and busiest origin countries to the seasonal patterns of passenger travel and the year over year impact of COVID-19.
This scatter plot visualizes the total number of flights recorded each month across 2018, 2019, and 2020. Each bubble represents a specific month-year combination, with the bubble’s size and color reflecting the volume of flights at that time. The chart shows consistent levels of total flights in the U.S. during 2018 and 2019, followed by a decline in flight activity beginning in April 2020, as the COVID-19 pandemic forced travel shutdowns in the U.S. and abroad. The darker, smaller bubbles in the lower half of the 2020 show the decline becoming extremely apparent out of nowhere.
import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/rober/miniconda3/Library/plugins/platforms'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import warnings
warnings.filterwarnings("ignore")
path = "C:/Users/rober/OneDrive/Desktop/Python/"
files = ['total-passengers-2018.csv',
'total-passengers-2019.csv',
'total-passengers-2020.csv']
df_combined = pd.concat([pd.read_csv(path + f, usecols=['year', 'month', 'passengers', 'origin_country']) for f in files], ignore_index=True)
x = df_combined.groupby(['year', 'month'])['year'].count().reset_index(name='total flight occurances')
x = pd.DataFrame(x)
x['total flight occurances_hundreds'] = round(x['total flight occurances']/100, 0)
plt.figure(figsize=(14,6))
plt.scatter(x['month'], x['year'], marker='8', cmap='viridis',
c=x['total flight occurances_hundreds'], s=x['total flight occurances_hundreds'], edgecolors='black',
vmin=x['total flight occurances_hundreds'].min(),
vmax=x['total flight occurances_hundreds'].max())
plt.title('Total Flights By Regional and Major Airlines That Had Passengers Go Through TSA', fontsize=18)
plt.xlabel('Months of the Year', fontsize=14)
plt.ylabel('Year', fontsize=14)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.yticks([2018, 2019, 2020])
cbar = plt.colorbar()
cbar.set_label('Total Flight Occurances', rotation=270, fontsize=14, color='black', labelpad=30)
my_colorbar_ticks = [*range(20, int(x['total flight occurances_hundreds'].max()), 20)]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = [*range(2000, int(x['total flight occurances'].max()), 2000)]
my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_tick_labels]
cbar.set_ticklabels(my_colorbar_tick_labels)
plt.text(11, 2017.7, '*2020 data only through October', fontsize=10, color='gray')
plt.show()
This bar chart breaks down which international origin countries contributed the most passengers traveling into the United States between 2018 and October 2020. Taking the United States out of consideration allows for a better picture of which countries we see the largest proportion of travel coming in from. This is not to say that there are not connecting flights that lead to these nations, but there is great information provided here showing what countries we engage the most with. The color-coding system identifies which countries perform above average, near average, and below average in passenger volume, providing a quick visual reference for understanding where the bulk of international air traffic originates and which countries are outliers in either direction.
def pick_colors_according_to_mean_count(this_data):
colors=[]
avg = this_data.passengers.mean()
for each in this_data.passengers:
if each > avg*1.01:
colors.append('lightcoral')
elif each < avg*0.99:
colors.append('green')
else:
colors.append('black')
return colors
x = df_combined.groupby('origin_country')['passengers'].sum().reset_index()
x = pd.DataFrame(x)
x_no_us = x[x['origin_country'] != 'US'].sort_values('passengers', ascending=False).reset_index(drop=True)
bottom1 = 1
top1 = 157
d1 = x_no_us.loc[bottom1:top1]
my_colors1 = pick_colors_according_to_mean_count(d1)
bottom2 = 1
top2 = 10
d2 = x_no_us.loc[bottom2:top2]
my_colors2 = pick_colors_according_to_mean_count(d2)
Above = mpatches.Patch(color='lightcoral', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of the Average')
Below = mpatches.Patch(color='green', label='Below Average')
fig = plt.figure(figsize=(40,32))
fig.suptitle('Frequency of Passengers Traveling Analysis by International Origin Country:\n Top ' + str(top1) + ' and Top ' + str(top2) + ' From 2018- Oct. 2020', fontsize=18, fontweight='bold')
ax1 = fig.add_subplot(2, 1, 1)
ax1.bar(d1.origin_country, d1.passengers, label='Passengers', color=my_colors1)
#ax1.legend(fontsize=14)
ax1.legend(handles=[Above, At, Below], fontsize=14)
plt.axhline(d1.passengers.mean(), color='black', linestyle='dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.xaxis.set_visible(False)
ax1.set_title('Top ' + str(top1) + ' Countries', size=20)
ax1.text(top1-10, d1.passengers.mean()+5, 'Mean = ' + str(round(d1.passengers.mean())), rotation=0, fontsize=14)
ax2 = fig.add_subplot(2, 1, 2)
ax2.bar(d2.origin_country, d2.passengers, label='Passengers', color=my_colors2)
#ax2.legend(fontsize=14)
ax2.legend(handles=[Above, At, Below], fontsize=14)
plt.axhline(d2.passengers.mean(), color='black', linestyle='solid')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
#ax2.axes.xaxis.set_visible(False)
ax2.set_title('Top ' + str(top2) + ' Countries', size=20)
ax2.text(top1-1, d2.passengers.mean()+5, 'Mean = ' + str(round(d1.passengers.mean())), rotation=0, fontsize=14)
plt.show()
This waterfall diagram tells the most powerful story of the three years of data by showing the year over year change in total passengers. Starting from the 2018 baseline of over one billion passengers, the chart shows a modest green increase into 2019, followed by a massive red decline in 2020 due to the COVID-19 pandemic. The grand total bar on the right shows the combined passenger count across all three years. The visual contrast between the green and red bars makes it immediately clear just how severe the impact of COVID-19 was on global air travel.
#Data Preparation for Waterfall Diagram
wf_df = df_combined.groupby('year')['passengers'].sum().reset_index()
#Individual Variable Creation For Total Passenger Count for Each Year
total_2018 = wf_df[wf_df['year'] == 2018]['passengers'].values[0]
total_2019 = wf_df[wf_df['year'] == 2019]['passengers'].values[0]
total_2020 = wf_df[wf_df['year'] == 2020]['passengers'].values[0]
#Data Deviations
change_2019 = total_2019 - total_2018
change_2020 = total_2020 - total_2019
grand_total = total_2018 + total_2019 + total_2020
wf_df['deviation'] = [0, change_2019, change_2020]
wf_df['year'] = wf_df['year'].astype(str)
wf_df.loc[wf_df.index.max()+1] = ['Total', wf_df['passengers'].sum(), 0]
years = ['2018', '2019', '2020', 'Total']
wf_df['year'] = pd.Categorical(wf_df['year'], categories = years, ordered = True)
wf_df.sort_values(by='year', inplace=True)
wf_df.reset_index(inplace= True, drop = True)
import plotly.graph_objects as go
import plotly.io as pio
if wf_df.loc[3, 'deviation'] > 0:
end_color = 'black'
elif wf_df.loc[3, 'deviation'] < 0:
end_color = 'red'
else: end_color = 'steelblue'
wf_fig = go.Figure(go.Waterfall(
name='',
orientation='v',
x=['2018', '2018 to 2019', '2019 to 2020', 'Total'],
textposition='outside',
measure=['relative', 'relative', 'relative', 'total'],
y=[total_2018/1e9, change_2019/1e9, change_2020/1e9, grand_total/1e9],
text=['{:.2f}B'.format(total_2018/1e9),
'{:.2f}B'.format(change_2019/1e9),
'{:.2f}B'.format(change_2020/1e9),
'{:.2f}B'.format(grand_total/1e9)],
decreasing={'marker':{'color':'red'}},
increasing={'marker':{'color':'green'}},
totals={'marker':{'color': end_color}}))
wf_fig.update_layout(yaxis=dict(tickformat='.1f'))
wf_fig.update_xaxes(
title_text='Years and Their Recorded Deviation From Prior Year Total Passenger Count Starting From 2018',
title_font={'size': 18}
)
wf_fig.update_yaxes(
title_text='Total Passengers (Billions)',
title_font={'size': 18},
dtick=0.5,
ticksuffix='B',
zeroline=True
)
wf_fig.update_layout(
title=dict(
text='Total Passenger Traffic 2018-2020: Impact of COVID-19 on Air Travel (Waterfall Diagram)<br>' +
'Increases in overall passenger traffic appear in Green, Declines appear in Red',
font=dict(family='Arial', size=18, color='black')
),
template='simple_white',
title_x=0.5,
showlegend=False,
autosize=True,
height=800,
width=1400,
margin=dict(l=100, r=100, t=250, b=100)
)