Flights Data Exploration

A real-world dataset containing flights data from the US Department of Transportation will be explored.

Loading and viewing the data.

In [1]:
import pandas as pd
import scipy.stats as stats

# Plots will be displayed inline
%matplotlib inline

from matplotlib import pyplot as plt
In [2]:
# Read dataset
flights = pd.read_csv('data/flights.csv')
flights.head()

# Dimensions
flights.shape
Out[2]:
(271940, 20)

The dataset contains observations of US domestic flights in 2013, and consists of the following fields:

  • Year: The year of the flight (all records are from 2013)
  • Month: The month of the flight
  • DayofMonth: The day of the month on which the flight departed
  • DayOfWeek: The day of the week on which the flight departed - from 1 (Monday) to 7 (Sunday)
  • Carrier: The two-letter abbreviation for the airline.
  • OriginAirportID: A unique numeric identifier for the departure aiport
  • OriginAirportName: The full name of the departure airport
  • OriginCity: The departure airport city
  • OriginState: The departure airport state
  • DestAirportID: A unique numeric identifier for the destination aiport
  • DestAirportName: The full name of the destination airport
  • DestCity: The destination airport city
  • DestState: The destination airport state
  • CRSDepTime: The scheduled departure time
  • DepDelay: The number of minutes departure was delayed (flight that left ahead of schedule have a negative value)
  • DelDelay15: A binary indicator that departure was delayed by more than 15 minutes (and therefore considered "late")
  • CRSArrTime: The scheduled arrival time
  • ArrDelay: The number of minutes arrival was delayed (flight that arrived ahead of schedule have a negative value)
  • ArrDelay15: A binary indicator that arrival was delayed by more than 15 minutes (and therefore considered "late")
  • Cancelled: A binary indicator that the flight was cancelled

Exploration of the flight data to analyze possible factors that affect delays in departure or arrival of a flight.

  1. Start by cleaning the data.
    • Identify any null or missing data, and impute appropriate replacement values.
    • Identify and eliminate any outliers in the DepDelay and ArrDelay columns.
  2. Explore the cleaned data.
    • View summary statistics for the numeric fields in the dataset.
    • Determine the distribution of the DepDelay and ArrDelay columns.
    • Use statistics, aggregate functions, and visualizations to answer the following questions:
      • What are the average (mean) departure and arrival delays?
      • How do the carriers compare in terms of arrival delay performance?
      • Is there a noticable difference in arrival delays for different days of the week?
      • Which departure airport has the highest average departure delay?
      • Do late departures tend to result in longer arrival delays than on-time departures?
      • Which route (from origin airport to destination airport) has the most late arrivals?
      • Which route has the highest average arrival delay?

Cleaning dataset

Identify any null or missing data, and impute appropriate replacement values.

In [3]:
# Checking null and missing data
flights.isnull().sum()
Out[3]:
Year                    0
Month                   0
DayofMonth              0
DayOfWeek               0
Carrier                 0
OriginAirportID         0
OriginAirportName       0
OriginCity              0
OriginState             0
DestAirportID           0
DestAirportName         0
DestCity                0
DestState               0
CRSDepTime              0
DepDelay                0
DepDel15             2761
CRSArrTime              0
ArrDelay                0
ArrDel15                0
Cancelled               0
dtype: int64

There are 2761 null values in the variable DepDel15.

In [4]:
# Checking some lines with null values
flights[flights.isnull().any(axis=1)][['DepDelay','DepDel15']]
Out[4]:
DepDelay DepDel15
171 0 NaN
359 0 NaN
429 0 NaN
545 0 NaN
554 0 NaN
... ... ...
271410 0 NaN
271607 0 NaN
271634 0 NaN
271671 0 NaN
271885 0 NaN

2761 rows × 2 columns

In [5]:
flights[flights.isnull().any(axis=1)].DepDelay.describe()
Out[5]:
count    2761.0
mean        0.0
std         0.0
min         0.0
25%         0.0
50%         0.0
75%         0.0
max         0.0
Name: DepDelay, dtype: float64

NaN values in DepDel15 are from DepDelay with value 0. It is considered that if DepDelay is 0, then the flight was not delayed, so DepDelay15 will be filled with 0.

In [6]:
# Filling NA values in DepDel15 with 0
flights.DepDel15 = flights.DepDel15.fillna(0)

# Checking null and missing data again
flights.isnull().sum()
flights.head()
Out[6]:
Year Month DayofMonth DayOfWeek Carrier OriginAirportID OriginAirportName OriginCity OriginState DestAirportID DestAirportName DestCity DestState CRSDepTime DepDelay DepDel15 CRSArrTime ArrDelay ArrDel15 Cancelled
0 2013 9 16 1 DL 15304 Tampa International Tampa FL 12478 John F. Kennedy International New York NY 1539 4 0.0 1824 13 0 0
1 2013 9 23 1 WN 14122 Pittsburgh International Pittsburgh PA 13232 Chicago Midway International Chicago IL 710 3 0.0 740 22 1 0
2 2013 9 7 6 AS 14747 Seattle/Tacoma International Seattle WA 11278 Ronald Reagan Washington National Washington DC 810 -3 0.0 1614 -7 0 0
3 2013 7 22 1 OO 13930 Chicago O'Hare International Chicago IL 11042 Cleveland-Hopkins International Cleveland OH 804 35 1.0 1027 33 1 0
4 2013 5 16 4 DL 13931 Norfolk International Norfolk VA 10397 Hartsfield-Jackson Atlanta International Atlanta GA 545 -1 0.0 728 -9 0 0

View summary statistics for the numeric fields in the dataset.

In [7]:
flights.describe()
Out[7]:
Year Month DayofMonth DayOfWeek OriginAirportID DestAirportID CRSDepTime DepDelay DepDel15 CRSArrTime ArrDelay ArrDel15 Cancelled
count 271940.0 271940.000000 271940.000000 271940.000000 271940.000000 271940.000000 271940.000000 271940.000000 271940.000000 271940.000000 271940.000000 271940.000000 271940.000000
mean 2013.0 6.978514 15.814775 3.899316 12744.899584 12739.136493 1325.498713 10.350449 0.199092 1505.477819 6.496650 0.215930 0.010723
std 0.0 1.982431 8.799033 1.987219 1501.199773 1502.569949 470.748546 35.673710 0.399318 493.204020 38.230331 0.411467 0.102995
min 2013.0 4.000000 1.000000 1.000000 10140.000000 10140.000000 1.000000 -63.000000 0.000000 1.000000 -75.000000 0.000000 0.000000
25% 2013.0 5.000000 8.000000 2.000000 11292.000000 11292.000000 920.000000 -4.000000 0.000000 1119.000000 -11.000000 0.000000 0.000000
50% 2013.0 7.000000 16.000000 4.000000 12892.000000 12892.000000 1320.000000 -1.000000 0.000000 1529.000000 -3.000000 0.000000 0.000000
75% 2013.0 9.000000 23.000000 6.000000 14057.000000 14057.000000 1725.000000 9.000000 0.000000 1918.000000 10.000000 0.000000 0.000000
max 2013.0 10.000000 31.000000 7.000000 15376.000000 15376.000000 2359.000000 1425.000000 1.000000 2359.000000 1440.000000 1.000000 1.000000

Determine the distribution of the DepDelay and ArrDelay columns.

In [11]:
# Create a function showing distribution
def show_distribution(var_data):
    from matplotlib import pyplot as plt

    # Get statistics
    min_val = var_data.min()
    max_val = var_data.max()
    mean_val = var_data.mean()
    med_val = var_data.median()
    mod_val = var_data.mode()[0]

    print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\n'.format(min_val,
                                                                                            mean_val,
                                                                                            med_val,
                                                                                            mod_val,
                                                                                            max_val))

    # Create a figure for 2 subplots (2 rows, 1 column)
    fig, ax = plt.subplots(2, 1, figsize = (10,4))

    # Plot the histogram   
    ax[0].hist(var_data)
    ax[0].set_ylabel('Frequency')

    # Add lines for the mean, median, and mode
    ax[0].axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)

    # Plot the boxplot   
    ax[1].boxplot(var_data, vert=False)
    ax[1].set_xlabel('Value')

    # Add a title to the Figure
    fig.suptitle('Data Distribution')

    # Show the figure
    fig.show()
In [12]:
    # Get the variable to examine
    col = flights['DepDelay']
    # Call the function
    show_distribution(col)
Minimum:-63.00
Mean:10.35
Median:-1.00
Mode:-3.00
Maximum:1425.00

In [13]:
# Get the variable to examine
col = flights['ArrDelay']
# Call the function
show_distribution(col)
Minimum:-75.00
Mean:6.50
Median:-3.00
Mode:0.00
Maximum:1440.00

Identify and eliminate any outliers in the DepDelay and ArrDelay columns

In [14]:
# Removing Outliers - between 0.01 and 90 percentile
q90 = flights.DepDelay.quantile(0.90)
q01 = flights.DepDelay.quantile(0.01)

# Removing Outliers - between 0.01 and 90 percentile
q90a = flights.ArrDelay.quantile(0.90)
q01a = flights.ArrDelay.quantile(0.01)

# Filtered flights
flights = flights[(flights.DepDelay>q01) & (flights.DepDelay<q90)]
flights = flights[(flights.ArrDelay>q01a) & (flights.ArrDelay<q90a)]
flights.describe()

delayFields = ['DepDelay','ArrDelay']

# View the revised distributions
for col in delayFields:
    show_distribution(flights[col])
                          
Minimum:-11.00
Mean:1.29
Median:-1.00
Mode:-3.00
Maximum:36.00

Minimum:-32.00
Mean:-3.10
Median:-5.00
Mode:0.00
Maximum:37.00

In [15]:
# Number of Flights by Day of Week
print(flights.groupby(flights.DayOfWeek).Year.count())
DayOfWeek
1    35330
2    35096
3    35344
4    34124
5    34033
6    28070
7    32805
Name: Year, dtype: int64

There are less flights on Saturdays. On Sat the number of flights is smaller.

In [16]:
# Cancelled flights
pass_counts = flights['Cancelled'].value_counts()
plt.pie(pass_counts, labels=pass_counts)
plt.legend(pass_counts.keys().tolist())
Out[16]:
<matplotlib.legend.Legend at 0x7fabe0273278>
In [17]:
# Mean Dep Delay and Arr Delay by Day of Week
print(flights.groupby(flights.DayOfWeek)[delayFields].mean())
for col in delayFields:
    flights.boxplot(column=col, by='DayOfWeek', figsize=(8,8))
           DepDelay  ArrDelay
DayOfWeek                    
1          1.352109 -3.004444
2          0.772424 -3.838871
3          1.064679 -2.852733
4          2.021480 -1.744286
5          1.847148 -2.315723
6          0.766227 -4.609975
7          1.117543 -3.636915

How do the carriers compare in terms of arrival delay performance?

In [18]:
print(flights.groupby(flights.Carrier)[delayFields].mean())
for col in delayFields:
    flights.boxplot(column=col, by='Carrier', figsize=(8,8))
         DepDelay  ArrDelay
Carrier                    
9E      -1.183773 -6.071124
AA       1.441176 -3.882760
AS      -2.085376 -2.779544
B6      -0.049172 -3.906673
DL      -0.060244 -5.305459
EV      -0.115334 -4.862082
F9       1.636004  1.312562
FL       1.898731 -1.998780
HA      -2.331828 -1.753950
MQ       0.835789 -1.967846
OO      -0.787541 -2.750888
UA       1.923266 -5.740674
US      -1.167277 -2.836717
VX       0.817295 -3.971063
WN       5.038805  0.045423
YV      -0.823834 -2.469025

Which departure airport has the highest average departure delay?

In [19]:
departure_airport_group = flights.groupby(flights.OriginAirportName)

mean_departure_delays = pd.DataFrame(departure_airport_group['DepDelay'].mean()).sort_values('DepDelay', ascending=False)
mean_departure_delays.plot(kind = "bar", figsize=(12,12))
mean_departure_delays
Out[19]:
DepDelay
OriginAirportName
Chicago Midway International 6.240506
Dallas Love Field 4.947875
William P Hobby 4.945270
Baltimore/Washington International Thurgood Marshall 3.805278
Denver International 3.681906
... ...
Tucson International -1.372188
Richmond International -1.609312
Kahului Airport -1.852998
Ted Stevens Anchorage International -1.955453
Long Beach Airport -2.269333

70 rows × 1 columns

Do late departures tend to result in longer arrival delays than on-time departures?

In [20]:
flights.boxplot(column='ArrDelay', by='DepDel15', figsize=(12,12))
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fabdbaf3588>

Which route (from origin airport to destination airport) has the most late arrivals?

In [24]:
# Add a routes column
routes  = pd.Series(flights['OriginAirportName'] + ' -> ' + flights['DestAirportName'])
flights = pd.concat([flights, routes.rename("route")], axis=1)

# Sum arr delays per route
routearrdel = flights.groupby(flights.route)
pd.DataFrame(routearrdel['ArrDel15'].sum()).sort_values('ArrDel15', ascending=False)
Out[24]:
ArrDel15
route
San Francisco International -> Los Angeles International 146
Los Angeles International -> San Francisco International 123
McCarran International -> Los Angeles International 90
LaGuardia -> Hartsfield-Jackson Atlanta International 88
Los Angeles International -> McCarran International 80
... ...
Luis Munoz Marin International -> Ronald Reagan Washington National 0
Luis Munoz Marin International -> Jacksonville International 0
Luis Munoz Marin International -> Chicago O'Hare International 0
Tampa International -> Cleveland-Hopkins International 0
Louis Armstrong New Orleans International -> San Francisco International 0

2481 rows × 1 columns

Which route has the highest average arrival delay?

In [25]:
# get the mean value and order from higher to minor
pd.DataFrame(routearrdel['ArrDelay'].mean()).sort_values('ArrDelay', ascending=False)
Out[25]:
ArrDelay
route
Louis Armstrong New Orleans International -> Ronald Reagan Washington National 25.333333
Cleveland-Hopkins International -> Palm Beach International 18.000000
Denver International -> Raleigh-Durham International 14.153846
Charlotte Douglas International -> General Mitchell International 13.500000
Cleveland-Hopkins International -> Philadelphia International 12.800000
... ...
Lambert-St. Louis International -> Cleveland-Hopkins International -20.000000
Jacksonville International -> Chicago Midway International -20.000000
Eppley Airfield -> LaGuardia -20.750000
Denver International -> Kahului Airport -22.666667
Indianapolis International -> Logan International -26.000000

2481 rows × 1 columns

In [23]:
# Checking DepDelay of cancelled flights
flights.boxplot(column='DepDelay', by='Cancelled', figsize=(8,5))
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fabdba6f438>