Air travel has been one of the biggest changes in the ways people travel in the pass 100 years. Instead of taking days to travel across the country, you could be eating breakfast in New York and lunch in Los Angeles. As air travel became more main-stream and available to everyday users, airports became busier and busier. This increase in traffic leads to delays in travel times. This data set looks at various aspects of delayed flights during the 2015 year. The main aspects that I look at in my data are the different carriers, the tail numbers of the planes, the total delayed time, and dates during the data set.
The data set for the flight delays is 564.94 MB and is 31 columns. The data set includes information about delayed flights during the year 2015. The main columns I used were the Airline carrier codes, the tail number of air crafts, departure delays, and various date functions. I totaled the departure delays and then divided by 60 to calculate total departure delays in hours to make it easy for the user to understand the visualizations. One issue I encountered with the data was the way the dates were presented. The Year, Month, and Day of the Week columns were individually separated. This presented an issue with having to manipulate the data in buckets such as Quarter 1, etc. An issue with the departure delay column was the time of the delay was not recorded in the hour, minute, second format but rather an integer. This didn’t hinder the ability to produce visualization but rather something to notice when working with the data.
This visualization looks at the different major airline carriers in the United States and their delayed flights during the year. The delays are broken down by month to see what flights are delayed and in what months those delays occur. Southwest Airlines, or WN, which is their airline code, had the most delayed flights every month of the year. As Southwest uses the point to point model instead of the more popular hub and spoke model. This has been a hit with many flyers as Southwest offers many routes from smaller market airports to other smaller market airports for a cheap price. The one major drawback to this style of flight distribution is if one plane is late due to weather or labor shortages then it causes a ripple effect that can’t be easily fixed. This is a strong reason for the high number of delayed flights compared to the other airline which use hub and spoke.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
path = "C:/Users/Michael/Documents/Grad loyola fall 2021/GB 736/"
filename = path + "flight delays.csv"
df = pd.read_csv(filename, usecols = ['MONTH', 'YEAR', 'AIRLINE'])
df = df[df['MONTH'] !=0]
x = df.groupby(['MONTH', 'AIRLINE'])['AIRLINE'].count().reset_index(name = 'count')
x = pd.DataFrame(x)
x['count_hundreds'] = round(x['count']/100,0)
omit = ['US']
x2 = x.loc[~x['AIRLINE'].isin(omit)]
plt.figure(figsize = (18,10))
plt.scatter(x2['MONTH'], x2['AIRLINE'], marker = '8', cmap = 'viridis',
c = x2['count_hundreds'], s = x2['count_hundreds'], edgecolors = 'black')
plt.title('Flight Cancelations by Airline by Months of the Year', fontsize = 18)
plt.xlabel('Months of the Year', fontsize = 14)
plt.ylabel('Airline', fontsize = 14)
cbar = plt.colorbar()
cbar.set_label('Number of Cancelations', rotation = 270, fontsize = 14, color = 'black', labelpad = 30)
my_colorbar_ticks = [*range(100, int(x2['count_hundreds'].max()), 100 )]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = [*range(10000, int(x2['count'].max()), 10000)]
my_colorbar_tick_labels = [ '{:,}'.format(each) for each in my_colorbar_tick_labels]
cbar.set_ticklabels(my_colorbar_tick_labels)
my_x_ticks = [*range(x2['MONTH'].min(), x2['MONTH'].max()+1, 1)]
plt.xticks(my_x_ticks, fontsize = 14, color = 'black')
plt.show()
This visualization shows the amount of times a certain plane is delayed. Looking at which aircraft have the most delayed flights could be an indication of their age which could be a signal to the airline to retire the plane from the fleet as the reliability of the plane is causing the airline money. The top ten aircraft that have the most delays are all 20 years old Boeing 717s that were first produced in 1998 which means all of those planes are almost as the first 717. So, older planes tend to have more mechanical issues which could cause delays. All top ten planes are operated by Hawaiian Airlines which could also explain some delays as tropical storms could prevent air travel. Whatever the reason for each delay, the overall tend is that planes still in use today are older which tends to increase the amount of delays.
df2 = pd.read_csv(filename, usecols = ['TAIL_NUMBER', 'DEPARTURE_DELAY'])
df2.isna().sum()
df2.DEPARTURE_DELAY.isna().sum()
df2.DEPARTURE_DELAY.fillna("Not Available", inplace = True)
df2.TAIL_NUMBER.fillna("Not Available", inplace = True)
omit = ['Not Available']
df2 = df2.loc[~df2['DEPARTURE_DELAY'].isin(omit)]
df2['DEPARTURE_DELAY'] = pd.to_numeric(df2['DEPARTURE_DELAY'])
x3 = df2.groupby(['TAIL_NUMBER']).agg({'TAIL_NUMBER':['count'], 'DEPARTURE_DELAY':['sum', 'mean']}).reset_index()
x3.columns = ['Tail_Number', 'Count', 'TotalDelay', 'AvgDelay']
x3 = x3.sort_values('Count', ascending = False)
x3.reset_index(inplace = True, drop = True)
def pick_colors_according_to_mean_count(this_data):
colors = []
avg = this_data.Count.mean()
for each in this_data.Count:
if each > avg*1.01:
colors.append('lightcoral')
elif each < avg*0.99:
colors.append('green')
else:
colors.append('black')
return colors
import matplotlib.patches as mpatches
bottom1 = 1
top1 = 250
d1 = x3.loc[bottom1:top1]
my_colors1 = pick_colors_according_to_mean_count(d1)
my_colors1
bottom2 = 1
top2 = 10
d2 = x3.loc[bottom2:top2]
my_colors2 = pick_colors_according_to_mean_count(d2)
my_colors2
Above = mpatches.Patch(color = 'lightcoral', label = 'Above Average')
At = mpatches.Patch(color = 'black', label = 'Within 1% of the Average')
Below = mpatches.Patch(color = 'green', label = 'Below Average')
fig = plt.figure(figsize = (18,16))
fig.suptitle('Frequency of Flight Delays by Aircraft Tail Number: \n Top ' + str(top1) + ' and Top ' +str(top2),
fontsize = 18, fontweight = 'bold')
ax1 = fig.add_subplot(2,1,1)
ax1.bar(d1.Tail_Number, d1.Count, label = 'Count', color = my_colors1)
ax1.legend(handles = [Above, At, Below], fontsize =14)
plt.axhline(d1.Count.mean(), color = 'black', linestyle = 'dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.xaxis.set_visible(False)
ax1.set_title('Top ' + str(top1) +' Aircraft Tail Numbers', size = 20)
ax1.text(top1-10, d1.Count.mean()+50, 'Mean = ' + str(d1.Count.mean()), rotation = 0, fontsize = 14 )
ax2 = fig.add_subplot(2, 1, 2)
ax2.bar(d2.Tail_Number, d2.Count, label = 'Count', color=my_colors2)
ax2.legend(handles = [Above, At, Below], fontsize =14)
plt.axhline(d2.Count.mean(), color = 'black', linestyle = 'dashed')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
#ax2.axes.xaxis.set_visible(False)
ax2.set_title('Top ' + str(top2) +' Aircraft Tail Numbers', size = 20)
ax2.text(top2-1, d2.Count.mean()+50, 'Mean = ' + str(d2.Count.mean()), rotation = 0, fontsize = 14 )
fig.subplots_adjust(hspace = 0.45)
plt.show()
In this visualization, I look at the total hours of delayed flights in each month for 2015. Understanding when during the year does air travel peak and when flights are delayed can play into strategies of flyers to best avoid those time periods or for companies to look at how they can improve to lower the number of delayed flights in those peak travel times. The Monday’s and Tuesday’s in June have the highest amount of time delayed. As June is the peak month for travel as summer vacation has just begun, Monday and Tuesday are good travel days to avoid high prices on the weekends and the week long vacations most families try to take during the summer. As the demand for air travel increases, the airlines might have to over work planes to meet the demand causing some older planes to have mechanical issues more frequently. The month of December also have a fair amount of delays which follows the increased travel trend as seen in June.
line_df = pd.read_csv(filename, usecols = ['MONTH', 'DAY_OF_WEEK', 'DEPARTURE_DELAY'])
line_df = line_df.groupby(['MONTH', 'DAY_OF_WEEK'])['DEPARTURE_DELAY'].sum().reset_index(name = 'Total_Departure_Delay')
line_df['TotalHours'] = line_df.Total_Departure_Delay/60
line_df['MONTH'] = line_df['MONTH'].astype(int)
line_df['DAY_OF_WEEK'] = line_df['DAY_OF_WEEK'].astype(int)
from matplotlib.ticker import FuncFormatter
fig = plt.figure(figsize = (18,10))
ax = fig.add_subplot(1,1,1)
my_colors = {1 : 'blue',
2 : 'red',
3 :'green',
4 :'gray',
5 :'purple',
6 :'gold',
7 :'brown'}
for key, grp in line_df.groupby(['DAY_OF_WEEK']):
grp.plot(ax=ax, kind = 'line', x = 'MONTH', y = 'TotalHours', color = my_colors[key], label = key, marker = '8')
plt.title('Total Departure Delays (in Hours) by Month', fontsize = 18)
plt.xticks(np.arange(1,13))
ax.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
ax.set_xlabel('Months', fontsize = 18)
ax.set_ylabel('Total Departure Delays (in Hours)', fontsize = 18, labelpad = 20)
ax.tick_params(axis = 'x', labelsize = 14, rotation = 0)
my_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
ax.legend(labels = my_labels)
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:('%1.1f')%(x*1e-0)))
plt.show()
This visualization, just like the line chart, shows the total delayed flight hours with a twist. This breaks up the hours delayed by quarter and then further by month. This gives the user a good look to see what quarter of the year has the most or least amount of hours of flight delays along with how the months break down per quarter. Quarter 2 leads the way with 254.8hrs of flight delays which makes up roughly 28.5%. June leads the way with nearly 13% of the quarters flight delays. This is the highest month by almost 3%. July is next closest with December not far behind July.
Q1 = [1,2,3]
Q2 = [4,5,6]
Q3 = [7,8,9]
Q4 = [10,11,12]
line_df['Quarter'] = line_df.MONTH
line_df['Quarter'][line_df['MONTH'].isin(Q1)] = "Quarter 1"
line_df['Quarter'][line_df['MONTH'].isin(Q2)] = "Quarter 2"
line_df['Quarter'][line_df['MONTH'].isin(Q3)] = "Quarter 3"
line_df['Quarter'][line_df['MONTH'].isin(Q4)] = "Quarter 4"
pie_df = line_df.groupby(['Quarter','MONTH'])['TotalHours'].sum().reset_index(name = 'TotalHours')
number_outside_colors = len(pie_df.Quarter.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4
number_inside_colors = len(pie_df.MONTH.unique())
all_color_ref_number = np.arange(number_outside_colors + number_inside_colors)
inside_color_ref_number = []
for each in all_color_ref_number:
if each not in outside_color_ref_number:
inside_color_ref_number.append(each)
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(1,1,1)
colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)
all_hours = pie_df.TotalHours.sum()
pie_df.groupby(['Quarter'])['TotalHours'].sum().plot(
kind = 'pie', radius = 1, colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1,
wedgeprops = dict(edgecolor = 'w'), textprops = {'fontsize':14},
autopct = lambda p: '{:.2f}%\n ({:.1f}hrs)'.format(p, (p/100)*all_hours/1e+3),
startangle = 90)
inner_colors = colormap(inside_color_ref_number)
pie_df.TotalHours.plot(
kind = 'pie', radius = 0.7, colors = inner_colors, pctdistance = 0.55, labeldistance = 0.8,
wedgeprops = dict(edgecolor = 'w'), textprops = {'fontsize':13},
labels = pie_df.MONTH,
autopct = '%1.2f%%',
startangle = 90)
hole = plt.Circle((0,0), 0.3, fc = 'white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.yaxis.set_visible(False)
plt.title('Total Hours Delayed by Qaurter and Month (in Thousands)', fontsize = 18)
ax.text(0,0, 'Total Hours\n' + str(round(all_hours/1e3, 2)) + 'hrs', size =18, ha = 'center', va = 'center')
ax.axis('equal')
plt.tight_layout()
plt.show()
This bump chart shows the how the different months rank in delayed flights on each day of the week. Not surprisingly, June and July are towards the top of the chart every day of the week. A surprising riser during the week is the month of January. It makes a steady climb during the later half of the week into the weekend. Bad weather during January along with the end of holiday travel could account for this increase of delayed flight hours towards the weekends in January.
bump_df = line_df.groupby(['MONTH', 'DAY_OF_WEEK'])['TotalHours'].sum().reset_index(name = 'Total Hours')
bump_df = bump_df.pivot(index = 'MONTH', columns = 'DAY_OF_WEEK', values = 'Total Hours')
bump_df_ranked = bump_df.rank(0, ascending = False, method = 'min')
bump_df_ranked = bump_df_ranked.T
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(1, 1, 1)
my_colors = {1 : 'blue',
2 : 'red',
3 :'green',
4 :'gray',
5 :'purple',
6 :'hotpink',
7 :'brown',
8 :'yellow',
9 :'black',
10 : 'maroon',
11 : 'aqua',
12 : 'orange'}
bump_df_ranked.plot(kind = 'line', ax= ax, marker = 'o', markeredgewidth = 1, linewidth = 6,
markersize = 18,
markerfacecolor = 'white',
color = my_colors,
label = key)
ax.invert_yaxis()
num_rows = bump_df_ranked.shape[0]
num_cols = bump_df_ranked.shape[1]
plt.ylabel('Ranking', fontsize = 18, labelpad = 10)
plt.title('Ranking of Total Hours Delayed by Month and by Day \n Bump Chart', fontsize = 18, pad = 15)
plt.yticks(range(1, num_cols + 1, 1), fontsize = 14)
ax.set_xlabel('Days of the Week', fontsize = 18)
ax.set_xticklabels([ '', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
my_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
ax.legend(labels = my_labels, bbox_to_anchor = (1.01, 1.01), fontsize = 14,
labelspacing = 1,
markerscale = 1,
borderpad = 1,
handletextpad = 0.8)
plt.show()
This data provides a wonderful dive into the nature of delayed flights during the different months and days of the week. First, we got an understanding of which airlines experience the most delayed flights which was Southwest. Their model of point to point I believe accounts for a majority of their delays as the planes could be stuck in other cities causing a ripple affect felt across many other destinations. Next, we moved onto looking at the aircraft tail numbers which caused the most flight delays. The top ten aircraft were all twenty year old planes flown by Hawaiian Airlines. This told us that a s planes get older along with being flown in an environment prone to inclement weather can cause issues leading to increasing delays. The next three visualizations all looked at different aspects of total delayed hours of flights. The first visualization was presented as a line plot showing the different days of the week per months of the year. The days on Monday and Tuesday in the month of June has the most amount of delayed hours for flights. The pie chart gave the user a look at how the flight delays broke down per quarter. The bump chart then ranked the months per day on the total flight delay hours. A common theme throughout the last three visualizations is that the summer months along with the holiday season there is an increase in delays as the amount of travel increases.