Flight Delays Python Data Visualization

Introduction

One of the most common nuisances for plane passengers are flight delays. Flight delays are very common among all airlines almost year-round. This visualization is going to investigate trends in flight delays during 2015 for multiple airlines through the year.

Dataset

This dataset of recorded flight delays has 31 total columns that list out flight delays throughout 2015. The main columns used for this dataset are the Airline codes, Departure Delay time, Arrival Delay time, and Dates of the delay recorded. The Departure delay times and Arrival delay times are all recorded in minutes and there are portions in the code where the times have converted to hours in certain visualizations. Given the lack of data for the US airline, it was omitted from all visualizations.

import os

os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/Leslie Chiang/Anaconda3/Library/plugins/platforms'

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import warnings
import calendar
from matplotlib.ticker import FuncFormatter

warnings.filterwarnings("ignore")


filename = 'L:\\Leslie Chiang\\Documents\\Grad school\\DS736\\Python Project\\flight delays.csv'
df = pd.read_csv(filename)
#drop columns
df = df.drop(['TAXI_OUT', 'WHEELS_OFF', 'WHEELS_ON', 'TAXI_IN', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY',
       'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'], axis=1)

Findings

Scatterplot

Before looking into delay patterns in the data, we want to get a big picture idea of exactly how many flights there were in total throughout the year for all airlines. This would help us better understand the percentage of delays for each airline in comparison to how many flights they have in total. The scatterplot below illustrates the total flights each airline flies throughout the months in the year. We can see clearly that airlines WN, DL, and AA seem to consistently have the most flights throughout the year. Another observation is that the number of flights seem to be consistent throughout the year regardless of the months.

#get dataset for necessary columns for scatterplot

df2 = df[['MONTH', 'AIRLINE', 'CANCELLED']].copy()

       
df2 = df2[df2['CANCELLED']  !=1] 
x = df2.groupby(['MONTH', 'AIRLINE'])['AIRLINE'].count().reset_index(name = 'count')
x = pd.DataFrame(x)
x['count_hundreds'] = round(x['count']/100,0)


omit = ['US'] #omitting US since there is not enough data for this airline
scat_df = x.loc[~x['AIRLINE'].isin(omit)]

#create scatterplot
plt.figure(figsize = (18,10))
plt.scatter(scat_df['MONTH'], scat_df['AIRLINE'], marker = '8', cmap = 'viridis',
           c = scat_df['count_hundreds'], s = scat_df['count_hundreds'], edgecolors = 'black')
plt.title('Total Flight per Airline by Months of the Year', fontsize = 18)
plt.xlabel('Months of the Year', fontsize = 14)
plt.ylabel('Airline', fontsize = 14)
cbar = plt.colorbar()
cbar.set_label('Number of Flights', rotation = 270, fontsize = 14, color = 'black', labelpad = 30)
my_colorbar_ticks = [*range(100, int(scat_df['count_hundreds'].max()), 100 )]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = [*range(10000, int(scat_df['count'].max()), 10000)]
my_colorbar_tick_labels = [  '{:,}'.format(each) for each in my_colorbar_tick_labels]
cbar.set_ticklabels(my_colorbar_tick_labels)
my_x_ticks = [*range(scat_df['MONTH'].min(), scat_df['MONTH'].max()+1, 1)]
plt.xticks(my_x_ticks, fontsize = 14, color = 'black')

plt.show()

Multiple Line Plot

The next visualization shown is illustrating the trend of total departure delays in hours over the course of the year. Each airline is shown with the number of flights delayed in each month. We can see that the number of flights each airline provides is pretty much proportional to the frequency of delays as a result, which is to be expected. Note that from this visualization we can view which month during the year had the most delays. The months June and July seem to peak with the most delays during that time period. January and December also have a high influx of delays.

#get dataframe for line plot
line_df = df[['DEPARTURE_DELAY', 'MONTH', 'AIRLINE']].copy()
line_df = line_df.groupby(['MONTH', 'AIRLINE'])['DEPARTURE_DELAY'].sum().reset_index(name='Total_Departure_Delay') #get total time and convert to hours

line_df['TotalHours'] = line_df['Total_Departure_Delay'] / 60
line_df['MONTH'] = line_df['MONTH'].astype(int)
line_df = line_df.loc[~line_df['AIRLINE'].isin(omit)]  # take out US

#plot the multiple line plot

fig = plt.figure(figsize=(20, 12))
ax = fig.add_subplot(1, 1, 1)

for key, grp in line_df.groupby(['AIRLINE']):
    grp.plot(ax=ax, kind='line', x='MONTH', y='TotalHours', label=key, marker='8')

plt.title('Total Departure Delays (in Hours) by Month', fontsize=18)
plt.xticks(np.arange(1, 13))

y_ticks = ax.get_yticks()
formatted_y_ticks = ["{:,}".format(int(y)) for y in y_ticks]
ax.set_yticklabels(formatted_y_ticks)
ax.set_ylabel('Total Departure Delays (in Hours)', fontsize=18, labelpad=20)
plt.show()

Dual Axis Bar

For this visualization, the top 10 airlines with the highest average of arrival delay times were taken. The dual axis plot takes these airlines and compares the average departure delays and average arrival delays between them. It is shown that departure delay average times are generally greater than arrival delay average times. Some interesting points to note are that despite WN having the most flights, NK and UA have one of the highest average departure delays. Another noteworthy observation is that NK and F9 airlines have very similar average arrival and delays times when compared to the other airlines.

#create dataframe for the bar graphs
d3 = df[['ARRIVAL_DELAY','DEPARTURE_DELAY', 'AIRLINE']].copy()
bar = d3.groupby('AIRLINE')[['ARRIVAL_DELAY', 'DEPARTURE_DELAY']].mean().reset_index()
bar = bar.loc[~bar['AIRLINE'].isin(omit)] #take out US
top_bar = bar.nlargest(10, 'ARRIVAL_DELAY') #get airlines with highest arrival delays

def autolabel(these_bars, this_ax, place_of_decimals, symbol):
    for each_bar in these_bars:
        height = each_bar.get_height()
        this_ax.text(each_bar.get_x() + each_bar.get_width()/2, height*1.01, symbol+format(height, place_of_decimals),
                     fontsize=11, color = 'black', ha = 'center', va = 'bottom')

#create dual axis bar graph
fig = plt.figure(figsize=(18, 10))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = ax1.twinx()
bar_width = 0.4

x_pos = np.arange(len(top_bar)) +.2

arrive_bar = ax1.bar(x_pos - (0.5 * bar_width), top_bar['ARRIVAL_DELAY'], bar_width, color='green', edgecolor='black', label='AVG Arrival Delay')
delay_bar = ax2.bar(x_pos + (0.5 * bar_width), top_bar['DEPARTURE_DELAY'], bar_width, color='blue', edgecolor='black', label='AVG Departure Delay')


ax1.set_xticks(x_pos)
ax1.set_xticklabels(top_bar['AIRLINE'], rotation=45, fontsize=18)

ax1.set_xlabel('Airline', fontsize = 18)
ax1.set_ylabel('Average Arrival Delay (in mintues)', fontsize =18, labelpad =20)
ax2.set_ylabel('Average Departure Delay(in minutes)', fontsize =18, rotation = 270, labelpad =20)
ax1.tick_params(axis ='y', labelsize =14)
ax2.tick_params(axis ='y', labelsize =14)

plt.title('Average Arrival Delay and Average Departure Delay Times\n Top 10 Airlines', fontsize=18)


count_color, count_label = ax1.get_legend_handles_labels()
fine_color, fine_label = ax2.get_legend_handles_labels()
legend = ax1.legend(count_color + fine_color, count_label + fine_label, loc = 'upper right', frameon=True, ncol =1, shadow = True,
                   borderpad =1, fontsize =14)

# Set the same y-axis limits for both axes
ax1.set_ylim(0, 17)

ax2.set_ylim(0, 17)

autolabel(arrive_bar, ax1, '.0f', '')
autolabel(delay_bar, ax1, '.0f', '')
plt.show()

Pie Chart

The next visualization is used to illustrate the distribution of hours throughout the year. The pie chart shows the year split into quarters and the number and percentage of total hours delayed for each quarter. We can see a fairly even distribution throughout the quarters in the year 2015. Upon further investigation on the months, we the same as in the line plot that June and July make up a greater percentage slightly than other months.

#create dataframe fro pie chart
pie_df = df[['MONTH', 'DEPARTURE_DELAY']].copy()

pie_df = pie_df.groupby(['MONTH'])['DEPARTURE_DELAY'].sum().reset_index(name = 'Total_Departure_Delay')

pie_df['TotalHours'] = pie_df.Total_Departure_Delay/60
pie_df['MONTH'] = pie_df['MONTH'].astype(int)

#split the months into the 4 quarters
Q1 = [1,2,3]
Q2 = [4,5,6]
Q3 = [7,8,9]
Q4 = [10,11,12]
pie_df['Quarter'] = pie_df.MONTH
pie_df['Quarter'][pie_df['MONTH'].isin(Q1)] = "Quarter 1"
pie_df['Quarter'][pie_df['MONTH'].isin(Q2)] = "Quarter 2"
pie_df['Quarter'][pie_df['MONTH'].isin(Q3)] = "Quarter 3"
pie_df['Quarter'][pie_df['MONTH'].isin(Q4)] = "Quarter 4"
pie_df = pie_df.groupby(['Quarter','MONTH'])['TotalHours'].sum().reset_index(name = 'TotalHours')
pie_df['MonthName'] = pie_df['MONTH'].apply(lambda x: calendar.month_abbr[x])


#create pie chart
number_outside_colors = len(pie_df.Quarter.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4

number_inside_colors = len(pie_df.MonthName.unique())
all_color_ref_number = np.arange(number_outside_colors + number_inside_colors)

inside_color_ref_number = []
for each in all_color_ref_number: 
    if each not in outside_color_ref_number:
        inside_color_ref_number.append(each)
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(1,1,1)
colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)

all_hours = pie_df.TotalHours.sum()

pie_df.groupby(['Quarter'])['TotalHours'].sum().plot(
    kind = 'pie', radius = 1, colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1,
    wedgeprops = dict(edgecolor = 'w'), textprops = {'fontsize':14},
    autopct = lambda p: '{:.2f}%\n ({:.1f}hrs)'.format(p, (p/100)*all_hours/1e+3),
    startangle = 90)

inner_colors = colormap(inside_color_ref_number)
pie_df.TotalHours.plot(
    kind = 'pie', radius = 0.7, colors = inner_colors, pctdistance = 0.55, labeldistance = 0.8,
    wedgeprops = dict(edgecolor = 'w'), textprops = {'fontsize':13},
    labels = pie_df.MonthName,
    autopct = '%1.2f%%',
    startangle = 90)

hole = plt.Circle((0,0), 0.3, fc = 'white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)


ax.yaxis.set_visible(False)
plt.title('Total Hours Delayed by Quarter and Month (in Thousands)', fontsize = 18)

ax.text(0,0, 'Total Hours\n' + str(round(all_hours/1e3, 2)) + 'hrs', size =18, ha = 'center', va = 'center')

ax.axis('equal')

plt.tight_layout()


plt.show()

Bump Chart

This bump chart shows the how the different airlines rank in delayed flights for each month. Again, we can see the rise in airline flight delays within the months of June and July. Another interesting trend to note is January starts off as high then slowly descends through the spring months. The trend starts to also pick back up in high average delay times during November and December which could be due to holiday travels.


#Create bump dataframe then transpose
bump_df = df.groupby(['AIRLINE', 'MONTH'])['DEPARTURE_DELAY'].mean().reset_index(name='AvgDelay')
bump_df= bump_df.loc[~bump_df['AIRLINE'].isin(omit)] #take out US
bump_df['MonthName'] = bump_df['MONTH'].apply(lambda x: calendar.month_abbr[x])
bump_df = bump_df.pivot(index='AIRLINE', columns='MonthName', values='AvgDelay')

month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
bump_df = bump_df.reindex(columns = month_order)
bump_df_ranked = bump_df.rank(0, ascending=False, method='min')


bump_df_ranked = bump_df_ranked.T


#create bump chart
fig5 = plt.figure(figsize=(23,20))
ax = fig5.add_subplot(1,1,1)

bump_df_ranked.plot(kind='line', ax=ax, marker='o', markeredgewidth=1, linewidth=6,
                    markersize=35,
                    markerfacecolor='white')
ax.invert_yaxis()

ax.set_xlabel('Airline', fontsize=18)


num_rows = bump_df_ranked.shape[0]
num_cols = bump_df_ranked.shape[1]

plt.ylabel('Monthly Ranking', fontsize=18, labelpad=10)
plt.xlabel('Months', fontsize=18, labelpad=10)
plt.title('Ranking of Average Time Delay by Months \n and Airline in 2015', fontsize=18, pad=15)
plt.yticks(np.arange(1, num_cols+1,1))

plt.xticks(np.arange(num_rows), month_order, fontsize = 14)

plt.legend(bbox_to_anchor=(1.01,1.01),fontsize=14,
           labelspacing = 1,
           markerscale = .4,
           borderpad = 1,
           handletextpad = 0.8)
i = 0
j = 0
for eachcol in bump_df_ranked.columns:
    for eachrow in bump_df_ranked.index:
        this_rank = bump_df_ranked.iloc[i,j]
        ax.text(i, this_rank, str(round(bump_df.iloc[j,i],2)), ha='center', va='center', fontsize=12)
        i+=1
    j+=1
    i=0


plt.show()

Conclusion

The flight delays dataset for the year 2015 provided a good summary of the nature of delayed flights during the months between the different airlines. First, it was good to understand how the number of flights total differed between the different airlines to understand the proportion of delays. Afterwards, we were able to look into how the total and average delays times for both departure and arrivals varied between the airlines as well as how it differed between the months. We can conclude from these visualizations that the summer months as well as months close to holidays will generally see an increase in flight delays. Additionally, airlines can use this dataset to better prepare them for those increase in flights during those months as well as get a good comparison to how they are performing with other airline companies.