Flights Data Set

This data set contains information about domestic flights in the United States in 2015 (from Thursday, Jan 1, 2015 through Thursday, Dec 31, 2015). The primary csv file (flights.csv) contains 5,819,079 observations of 31 variables, corresponding to 5,819,079 flights. Variables include the date and time of the flight, origin and destination airport codes, as well as information about delay time, delay reasons, cancellations, and diversions. Information regarding the arrival, departure, and delay times is given in minutes. In addition, two supplemental csv files (airlines.csv and airports.csv) provide additional information about the airlines and origin and destination airports.

Data Acquisition and Merging

The data was acquired as three csv files from: https://www.kaggle.com/datasets/usdot/flight-delays#airports.csv.

These files were read in as dataframes. I then merged the flights and airlines dataframe, as well as relevant information from the airports dataframe.

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = "C:\\Users\\johna\\anaconda3\\Library\\plugins\\platforms"

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import plotly.graph_objects as go

path = "C:\\Users\\johna\\Documents\\DS\\DS 736\\flight_delays\\"

#read in csv files
#source: https://www.kaggle.com/datasets/usdot/flight-delays#airports.csv
flights = pd.read_csv(path+"flights.csv")
airlines = pd.read_csv(path+"airlines.csv")
airports = pd.read_csv(path+"airports.csv")

#merge flights and airlines dataframes
flights.rename(columns={"AIRLINE":"AIRLINE_CODE"}, inplace=True)
airlines.rename(columns={"IATA_CODE":"AIRLINE_CODE", "AIRLINE":"AIRLINE_NAME"}, inplace=True)
df = pd.merge(flights, airlines, on="AIRLINE_CODE", how="outer")

#merge new df with select columns from airports dataframe
df.rename(columns={"ORIGIN_AIRPORT":"ORIGIN_AIRPORT_CODE","DESTINATION_AIRPORT":"DESTINATION_AIRPORT_CODE"}, inplace=True)
airports.rename(columns={"IATA_CODE":"ORIGIN_AIRPORT_CODE", "AIRPORT":"ORIGIN_AIRPORT_NAME"}, inplace=True)
df = pd.merge(df, airports[["ORIGIN_AIRPORT_CODE","ORIGIN_AIRPORT_NAME", "LONGITUDE"]], on="ORIGIN_AIRPORT_CODE", how="outer")

Data Cleaning

I took three main steps to prepare the data for analysis:

Creation of 3 New Variables

I created three new variables: REGION, WEEKDAY, and MONTH.

REGION classifies the region of a flight’s origin airport on the basis of longitude. Airports at -110 degrees longitude and West are classified as “West”, airlines between -95 and -110 degrees longitude are classified as “West-Central”, airlines between -95 and -85 degrees longitude are classified as “East-Central”, and airlines East of -85 degrees longitude are classified as “East”. The West region covers the pacific coast and western mountain region. The West-Central region covers the plains states and eastern mountain region. The East-Central region covers most of the Midwest. And the East region covers the east coast and eastern parts of the Midwest.

WEEKDAY specifies the day of the week a flight departed by name (Monday, Tuesday, etc.)

MONTH specifies the month that the flight departed by name (Jan, Feb, etc.), rather than by number, as in the month variable supplied in the data set.

#create REGION variable based on longitude of origin airport
df["REGION"] = np.where(df["LONGITUDE"] <= -110,
                        "West",
                        np.where(df["LONGITUDE"] <= -95,
                                 "West-Central",
                                 np.where(df["LONGITUDE"] <= -85,
                                          "East-Central",
                                          "East")))

#order region levels from East to West
region_order = ["East", "East-Central", "West-Central", "West"] 
df["REGION"] = pd.Categorical(df["REGION"], categories = region_order, ordered=True)


#create ordered WEEKDAY variable

#mapping dictionary
weekday_dict = {1:"Monday", 2:"Tuesday", 3:"Wednesday", 4:"Thursday", 5:"Friday", 6:"Saturday", 7:"Sunday"}

df["WEEKDAY"] = df["DAY_OF_WEEK"].map(weekday_dict)

#ordering days from Monday - Sunday
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df["WEEKDAY"] = pd.Categorical(df["WEEKDAY"], categories=day_order, ordered=True)


#created ordered MONTH name variable

#rename month variable as month number
df.rename(columns={"MONTH":"MONTH_NUM"}, inplace=True)

#create month name variable
month_dict = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun", 7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dec"}
df["MONTH"] = df["MONTH_NUM"].map(month_dict)

#order months from Jan to Dec
month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
df["MONTH"] = pd.Categorical(df["MONTH"], categories=month_order, ordered=True)

Elimination of 1 NA observation

1 observation with an NA value for “CANCELLED” was dropped. All other observations with NAs were retained as these referred to flights that were cancelled or diverted.

#remove 1 row with NA value for cancelled
df = df[df["CANCELLED"].notna()]

Creation of Dataframes

I created separate dataframes for delayed, cancelled, diverted, and “normal” (not cancelled or diverted) flights.

#create separate data frames for normal, cancelled, delayed, and diverted flights
diverted = df[(df["DIVERTED"]==1)]
cancelled = df[(df["CANCELLED"]==1)]
normal = df[(df["CANCELLED"]==0) & (df["DIVERTED"]==0)]
normal_cancelled = df[df["DIVERTED"]==0]

#dataframe of non-cancelled flights with arrival delays
delayed = normal[normal["ARRIVAL_DELAY"]>0]

Findings

Click the five tabs to see five visualizations summarizing my findings from this dataset.

Stacked Bar Chart

This stacked bar chart shows all the airlines in the data set ordered by the percentage of the airline’s flights that were delayed or cancelled.

Across all airlines in the data set, the average delay time of delayed flights was 33.11 minutes. The blue portion of the bar corresponds to the percentage of the airline’s flights that were delayed less than this average delay time of 33.11 mins; the orange portion of the bar corresponds to the percentage of the airline’s flights that were delayed more than this average delay time of 33.11 mins; and the green portion of the bar corresponds to the percentage of the airline’s flights that were cancelled.

Unsurprisingly, budget airlines like Spirit and Frontier had a higher percentage of delayed and cancelled flights. In turn, the premier domestic airlines, such as American and Delta had a lower percentage of delayed and cancelled flights. Delta had the lowest percentage of delayed and cancelled flights overall, at only 29.14%.

#create variable indicating flight's cancellation and delay status:
#values: cancelled, above-average delay, below-average delay, no delay

normal_cancelled["status"] = np.where(normal_cancelled["CANCELLED"] == 1, #if flight was cancelled
                                           "Cancelled",
                                           np.where(normal_cancelled["ARRIVAL_DELAY"] > delayed["ARRIVAL_DELAY"].mean(), #if delay exceeded average delay of delayed flights
                                                    "Above-Average Delay (> 33.11 mins)",
                                                    np.where(normal_cancelled["ARRIVAL_DELAY"] > 0, #if delayed but less than average delay of delayed flights
                                                             "Below-Average Delay (<= 33.11 mins)",
                                                             "No Delay"))) #if on time or early
                                                             
#calculate percentage of flights cancelled, above-average delayed, below-average delayed, or no delay for each airline
status_pcts = round(normal_cancelled.groupby("AIRLINE_NAME")["status"].value_counts()/normal_cancelled["AIRLINE_NAME"].value_counts()*100,2)

#convert to dataframe for plotting
status_pcts = status_pcts.unstack()

###prepare dataframe for graphing

#drop flights with no delay (on time or early flights)
status_pcts.drop(["No Delay"],axis=1, inplace=True)

#add column for rowwise sums (total number of flights delayed or cancelled for each airline)
status_pcts["Total"] = status_pcts.sum(axis=1)

#sort dataframe by rowwise sums
status_pcts.sort_values("Total", inplace=True)
#stacked bar chart
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)

status_pcts[["Below-Average Delay (<= 33.11 mins)", "Above-Average Delay (> 33.11 mins)", "Cancelled"]].plot(kind="barh", stacked=True, ax=ax)

totals = status_pcts[["Below-Average Delay (<= 33.11 mins)", "Above-Average Delay (> 33.11 mins)", "Cancelled"]].sum(axis=1)
for i, total in enumerate(totals):
    ax.text(total+1, i, str(round(total,2))+"%", color="black", size=12, va="center", fontweight="bold")
plt.xlim(0, totals.max()*1.1);
    
ax.set_title("Airlines by Percentage of Flights Delayed or Cancelled\n (Average Delay across Airlines: 33.11 Minutes)", fontsize=20)
ax.set_xlabel("Percentage of Flights Delayed or Cancelled", fontsize=18)
ax.set_ylabel("Airline", fontsize=18)
plt.xticks(np.arange(start=0, stop=max(status_pcts["Total"]+5), step=5), fontsize=14);
plt.yticks(fontsize=14);
ax.legend(loc="lower right", fontsize=14)

#prevent y-axis labels and title from being cut off
plt.tight_layout()

plt.show()

Bump Chart

This bump chart ranks the top 6 most frequent carriers by their average flight delay in minutes. The top 6 most frequent carriers include the “Big 4 airlines” that carry the majority of American passengers (American, Delta, Southwest, United), as well as 2 regional airlines (Skywest and Atlantic Southeast).

The chart provides the airline’s rank in each of the 12 months of 2015. Top rank (#1) is given to the airline with the shortest average delay in minutes. A negative average delay means that the airline’s average flight was early, while an average delay of 0 means the airline’s average flight was on time.

The numbers inside the circles indicate each airline’s average delay for each month. Zero and negative numbers (on-time and early) are given in black, while positive numbers (delayed) are given in red.

As with the stacked bar chart, the chart generally indicates the superior performance of the “Big 4” airlines to smaller carriers, in particular highlighting the on-time and early performance of Delta.

#determine top 6 most frequent airlines (i.e. most flights in data set)
most_freq_airlines = normal_cancelled["AIRLINE_NAME"].value_counts().head(6)

#calculate average arrival delay (minutes) per flight by airline by month for the top 6 most frequent airlines
most_freq_df = normal_cancelled[normal_cancelled["AIRLINE_NAME"].isin(most_freq_airlines.index)]
avgmins = most_freq_df.groupby(["AIRLINE_NAME","MONTH"])["ARRIVAL_DELAY"].sum()/most_freq_df.groupby("MONTH")["AIRLINE_NAME"].value_counts()

#convert to dataframe
avgmins = avgmins.unstack()

#rank by delay minutes per flight (1 = least delay mins, 6 = most delay mins) 
avgmins_ranked = avgmins.rank(axis=0, ascending=True)
#bump chart
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)

#plot transposed dataframe
avgmins_ranked.T.plot(kind="line", ax=ax, marker="o", 
                    markeredgewidth=1, linewidth=6, markersize=44,markerfacecolor="white")

ax.invert_yaxis()

plt.title("Ranking of 6 Most Frequent Carriers by Shortest Average Delay (in Mins)\n"
          "Red = Late || Black = On Time or Early", fontsize=22)
plt.xlabel("Month in 2015", fontsize=22)
plt.ylabel("Monthly Ranking", fontsize=22)

xtickcats = month_order
plt.xticks(range(len(xtickcats)), xtickcats, fontsize=18);
plt.yticks(np.arange(start=1, stop=avgmins_ranked.shape[0]+1, step=1), fontsize=18);

#organize order of legend to reflect Week 1 rankings
handles,labels = ax.get_legend_handles_labels()
handles = [handles[2], handles[4], handles[5], handles[0], handles[1], handles[3]]
labels = [labels[2], labels[4], labels[5], labels[0], labels[1], labels[3]]
ax.legend(handles, labels, bbox_to_anchor=(1.01, 1.01), fontsize=20,
          labelspacing= 1, markerscale=.5)

#function to set color of marker text
def early_late(df,x,y):
    if round(df.iloc[x,y]) <= 0:
        color = "black" #if average flight was on time or early (delay <=0 mins), text is black
    else:
        color = "red" #if average flight is late (delay >= 1 min), text is red
    return color

#fill markers with average delay in minutes
i = 0
j = 0
for eachcol in avgmins_ranked.T.columns:
    for eachrow in avgmins_ranked.T.index:
        this_rank = avgmins_ranked.iloc[j,i]
        ax.text(i, this_rank, round(avgmins.iloc[j,i]), ha="center", va="center", fontsize=18,
               color = early_late(avgmins,j,i)) #color set by calling color fuction defined above
        i+=1
    j+=1
    i=0

#prevent legend from being cut off
plt.tight_layout()
                
plt.show()

Multiple Line Chart

This multiple line chart shows the total delay minutes for each Weekday (Monday through Sunday) of the top 6 most frequent carriers. As in the bump chart, the top 6 most frequent carriers include the “Big 4 airlines” that carry the majority of American passengers (American, Delta, Southwest, United), as well as 2 regional airlines (Skywest and Atlantic Southeast).

The chart indicates that total delay minutes are lowest in the early-middle part of the week (Tuesday and Wednesday) and on weekends, while they are highest Thursdays, Fridays, and Mondays. This likely reflects increased travel at the beginning and end of the workweek, as well as increased flights causing an increase in the average delay per flight.

#dataframe of total delay minutes by top 6 carrier by weekday
total_delay = most_freq_df.groupby(["AIRLINE_NAME","WEEKDAY"])["ARRIVAL_DELAY"].sum().reset_index(name="TOTAL_DELAY")
#line chart
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)

for key, grp in total_delay.groupby("AIRLINE_NAME"):
    grp.plot(ax=ax, kind="line", x="WEEKDAY", y="TOTAL_DELAY", label=key, marker="D")

plt.title("Total Delay Minutes by Top 6 Carriers per Day of the Week", fontsize=22)
ax.set_xlabel("Day of the Week", fontsize=20)
ax.set_ylabel("Total Flight Delay Minutes", fontsize=20)

#organize order of legend to reflect Monday rankings
handles,labels = ax.get_legend_handles_labels()
handles = [handles[4], handles[1], handles[3], handles[5], handles[0], handles[2]]
labels = [labels[4], labels[1], labels[3], labels[5], labels[0], labels[2]]
ax.legend(handles, labels, loc="upper right", fontsize=14)

#y axis tick mark range
min_tick = total_delay["TOTAL_DELAY"].min() // 100000 * 100000
max_tick = total_delay["TOTAL_DELAY"].max() // 100000 * 100000 + 200000
ax.set_yticks(np.arange(start=min_tick, stop=max_tick, step=100000))

#y axis tick mark format
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f"{x:,.0f}"))

#tick mark font size
plt.xticks(fontsize=18);
plt.yticks(fontsize=18);

#increase display size
plt.tight_layout()

plt.show()

Nested Pie Chart

This nested pie chart shows the total minutes of delay caused by different types of delay and the total minutes of delay in each of the four longitudinal regions of the country. The chart only includes delayed flights for which there was a documented cause of delay: 1,063,439 flights (18.28% of the data set). When two or more causes of delay were documented, the delay recorded as causing the highest number of minutes of delay was assigned as the primary cause of delay.

The chart indicates that the cause of delay that caused the most total amount of delay minutes was the aircraft arriving late from its previous flight (41.81% of delay minutes or 22.99 million total delay minutes).

In addition, the East region experienced the highest number of delay minutes (42.81% or 23.54 million minutes).

The total number of delay minutes caused by flights with a documented delay was 54.99 million minutes: that is, 916,500 hours, or around 105 years worth of delays!

#data frame of flights with documented cause of delay

delay_types = delayed[((delayed["AIR_SYSTEM_DELAY"].notna()) & (delayed["AIR_SYSTEM_DELAY"]>0))|
                ((delayed["SECURITY_DELAY"].notna()) & (delayed["SECURITY_DELAY"]>0))|
                 ((delayed["AIRLINE_DELAY"].notna()) & (delayed["AIRLINE_DELAY"]>0))|
                 ((delayed["LATE_AIRCRAFT_DELAY"].notna()) & (delayed["LATE_AIRCRAFT_DELAY"]>0))|
                 ((delayed["WEATHER_DELAY"].notna()) & (delayed["WEATHER_DELAY"]>0))]
                 
#Determine Primary Delay type and Length of Primary Delay

delay_types["max_delay"] = delay_types[["AIR_SYSTEM_DELAY", "SECURITY_DELAY", "AIRLINE_DELAY", "LATE_AIRCRAFT_DELAY", "WEATHER_DELAY"]].max(axis=1)
delay_types["primary_delay_type"] = np.where(delay_types["AIR_SYSTEM_DELAY"] == delay_types["max_delay"],
                                             "Air System",
                                             np.where(delay_types["SECURITY_DELAY"] == delay_types["max_delay"],
                                                      "Security",
                                                      np.where(delay_types["AIRLINE_DELAY"] == delay_types["max_delay"],
                                                               "Airline",
                                                               np.where(delay_types["LATE_AIRCRAFT_DELAY"] == delay_types["max_delay"],
                                                                        "Late Aircraft",
                                                                        "Weather"))))
                                                                        
#data frame for donut chart
pie_df = delay_types.groupby(["REGION", "primary_delay_type"])["max_delay"].sum().reset_index(name="total_mins")

#set up colors
number_outside_colors = len(pie_df.primary_delay_type.unique())
outside_color_ref_number = np.arange(number_outside_colors*4, step=4)

#determine options for color not used on outside
inside_color_options = []
for i in np.arange(0,20):
    if i not in outside_color_ref_number:
        inside_color_options.append(i)

#select evenly spaced colors from options        
indices = np.arange(len(inside_color_options), step=int(len(inside_color_options)/4))
inside_color_ref_number = []
for i in indices:
    inside_color_ref_number.append(inside_color_options[i])
    
#ensure proper number of colors
number_inside_colors = len(pie_df.REGION.unique())
if len(inside_color_ref_number) > number_inside_colors:
    del inside_color_ref_number[number_inside_colors:]
#pie chart
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)

#set color maps
colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)
inner_colors = colormap(inside_color_ref_number)

#total of all delay minutes across region and type
allmins = pie_df["total_mins"].sum()

#outer donut
pie_df.groupby("primary_delay_type")["total_mins"].sum().plot(
    kind="pie", radius=1, colors=outer_colors, pctdistance=.8, labeldistance=1.05,
    wedgeprops={"edgecolor":"white"}, textprops={"fontsize":18, "color":"black"},
    autopct= lambda x: "" if x<10 else f"{x:.2f}%\n{x/100*allmins/1e+6:.2f}M mins", #don't add text for delay types less than 10% of pie
    startangle=90)

#inner donut
pie_df.groupby("REGION")["total_mins"].sum().plot(
    kind="pie", radius=.6, colors=inner_colors, pctdistance=.75, labeldistance=.5,
    wedgeprops={"edgecolor":"white"}, textprops={"fontsize":14, "color":"black"},
    autopct = lambda x: f"{x:.2f}%\n{x/100*allmins/1e+6:.2f}M mins" if x >20 else (f"\n\n\n\n{x:.2f}%\n{x/100*allmins/1e+6:.2f}M mins" if x > 17.9 else f"{x:.2f}%\n{x/100*allmins/1e+6:.2f}M mins"), #ensure proper distance of text to label
    startangle=180)

#create hole
hole = plt.Circle((0,0), 0.2, fc="white")
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

#remove y axis
ax.yaxis.set_visible(False)
plt.title("Total Minutes of Delay by Delay Type and By Region\n(in Millions of Minutes)", fontsize=20)

#layout
ax.axis("equal");
plt.tight_layout()

#center text
ax.text(0,0,"Total Delay Mins\n" + str(round(allmins/1e+6,2)) + "M", ha="center", va="center", size=14) 

plt.show()

Waterfall Chart (Plotly)

This waterfall chart examines monthly revenue at Atlanta’s Hartsfield-Jackson International Airport, which was the most frequent origin airport of flights in the data set. The data set did not include information for Atlanta flights in October; therefore, this chart only covers the 11 other months of 2015.

To create this chart, two new variables were assigned:

  1. Each flight was assigned an airport revenue of $0 if the flight was cancelled, $1,000 if the flight departed on time or less than half an hour late, $1,200 if the flight had a moderate delay of betweeen 30 and 60 minutes (with increased revenue for the airport due to increased concession sales from the flight), and $500 if the flight had a long delay of more than an hour (with decreased revenue stemming from cancellations and refunds causing a reduction of airport fees.)

  2. Atlanta airport was a assigned a budgeted monthly revenue of $30,075,000. This approximates one month’s worth of Atlanta airport’s actual annual revenue.

The waterfall chart shows Atlanta airport’s cumulative surplus revenue – that is, its cumulative deviation from the monthly budget for each of the 11 months in the data set. Months show in black mean that the airport exceeded its budgeted monthly revenue of $30,075,000, while months shown in red mean the airport fell short of that month’s budget. Hovering over each bar in the chart shows the total revenue from all flights departing Atlanta airport that month, as well as the cumulative deviation from the monthly budget by that month.

As the waterfall chart shows, despite monthly deficits to start and end the year, Atlanta airport significantly exceeded its budgeted revenue, due to surpluses in March through November. As a result, Atlanta exceeded its monthly revenue budget by a cumulative total of $8,556,000 for a total revenue of $339,381,000.

#revenue per flight earned by origin airport
#This will approximate Atlanta's annual revenue

df["ORIGIN_REVENUE"] = np.where(df["CANCELLED"]==1,
                                0, #if flight is cancelled, $0
                                np.where(df["DEPARTURE_DELAY"] > 60,
                                         500, #if flight has long delay, reduced revenue
                                         np.where(df["DEPARTURE_DELAY"] > 30,
                                                  1200, #if flight has moderate delay, increased revenue from concessions
                                                  1000))) #standard revenue per flight

#data frame of most frequent origin airport (Atlanta)
atlanta = df[df["ORIGIN_AIRPORT_NAME"]==df["ORIGIN_AIRPORT_NAME"].value_counts().head(1).index.tolist()[0]]

#Atlanta has budgeted for a monthly revenue of $30,075,000
budgeted_revenue = 30075000

#data frame of monthly revenue for Atlanta airport
atl_revenue = atlanta.groupby("MONTH")["ORIGIN_REVENUE"].sum().reset_index(name="Revenue")

#remove October row from dataframe, since flight data was not reported for Atlanta in October 2015 in this dataset
atl_revenue = atl_revenue[atl_revenue["MONTH"] != "Oct"]

#budget column, pro-rated for Weeks 1 and 12
atl_revenue["Budget"] = budgeted_revenue

#deviation column
atl_revenue["Deviation"] = atl_revenue["Revenue"] - atl_revenue["Budget"]

#total row
atl_revenue.loc[atl_revenue.index.max()+1] = ["Total",
                                              atl_revenue["Revenue"].sum(),
                                              atl_revenue["Budget"].sum(),
                                              atl_revenue["Deviation"].sum()]
#end color
if atl_revenue.loc[atl_revenue.index.max(),"Deviation"] > 0:
    end_color = "navy"
elif atl_revenue.loc[atl_revenue.index.max(),"Deviation"] < 0:
    end_color = "darkred"
else: 
    end_color = "tan"
    
#waterfall plot
fig = go.Figure(go.Waterfall(name="", orientation = "v", x = atl_revenue["MONTH"], 
                             textposition = "outside", 
                             measure = ["relative", "relative", "relative", "relative",
                                       "relative", "relative", "relative", "relative",
                                       "relative", "relative", "relative", "total"],
                             y = atl_revenue["Deviation"],
                             meta = atl_revenue["Deviation"].cumsum().abs(), #absolute value of deviation to be used in hovertemplate
                             text = [f"${row:,.0f}" for row in atl_revenue["Revenue"]],
                            decreasing = {"marker":{"color":"red"}},
                             increasing = {"marker":{"color":"black"}},
                             totals = {"marker":{"color":end_color}},
                             hovertemplate = np.where(atl_revenue["Deviation"].cumsum() == atl_revenue["Deviation"].cumsum()[atl_revenue["Deviation"].index.max()],
                                                    "Cumulative Surplus: " + "$%{y:,.0f}" + "<br>" + #if total column, display deviation as total measure
                                                    "Total Revenue in %{x}: %{text}",
                                                      np.where(atl_revenue["Deviation"].cumsum() >= 0, #if monthly column, display cumulative deviation as relative measure
                                                            "Cumulative Surplus: " + "$%{meta:,.0f}" + "<br>" + #print "surplus" if cumulative deviation >=0
                                                            "Total Revenue in %{x}: %{text}",
                                                                "Cumulative Deficit: " + "-$%{meta:,.0f}" + "<br>" + #print "deficit" if cumulative deviation <0
                                                                "Total Revenue in %{x}: %{text}"))
                            ))

#format x axis
fig.update_xaxes(type="category", title_text="Month in 2015", title_font={"size":22}, tickfont={"size":14});

#format y axis
fig.update_yaxes(title_text = "Cumulative Surplus Revenue ($M)", title_font={"size":22},
                 dtick=1000000, 
                 tickprefix="$", tickfont={"size":12}, 
                 zeroline=True, zerolinecolor="green", zerolinewidth=1);

#format layout
fig.update_layout(title = dict(text = "Actual vs. Budgeted Monthly Revenue at Atlanta Airport<br>"+
                               "Surpluses appear in Black, Deficits appear in Red",
                               font = dict(family = "Balto", size=22, color="maroon")),
                  template = "plotly_white",
                  title_x = .5,
                  margin = dict(l=30, r=30, t=60, b=30));
                  
fig.show()

Recommendations

The analysis offers several takeaways for passengers, airlines, and for Atlanta airport.

  1. Passengers may wish to consider paying more for premium airlines rather than budget airlines, as the premier airlines tend to have fewer delays and cancellations, as well as shorter delay times on delayed flights. Passengers wishing to limit delays should also aim to travel on weekends, Tuesdays, and Wednesdays, as these days tend to have less total delay time.

  2. Airlines should consider reconfiguring their flights networks, as late arriving flights were the primary cause of delay for delayed departures.

  3. Atlanta airport should consider a dynamic revenue budget with distinct budget goals for each month, reflecting the fact that revenue appears to decrease in the winter and increase in the other seasons.