Flights Data Exploration¶

A real-world dataset containing flights data from the US Department of Transportation will be explored.

Loading and viewing the data.

import pandas as pd
import scipy.stats as stats

# Plots will be displayed inline
%matplotlib inline

from matplotlib import pyplot as plt

# Read dataset
flights = pd.read_csv('data/flights.csv')
flights.head()

# Dimensions
flights.shape

(271940, 20)

The dataset contains observations of US domestic flights in 2013, and consists of the following fields:

Year: The year of the flight (all records are from 2013)
Month: The month of the flight
DayofMonth: The day of the month on which the flight departed
DayOfWeek: The day of the week on which the flight departed - from 1 (Monday) to 7 (Sunday)
Carrier: The two-letter abbreviation for the airline.
OriginAirportID: A unique numeric identifier for the departure aiport
OriginAirportName: The full name of the departure airport
OriginCity: The departure airport city
OriginState: The departure airport state
DestAirportID: A unique numeric identifier for the destination aiport
DestAirportName: The full name of the destination airport
DestCity: The destination airport city
DestState: The destination airport state
CRSDepTime: The scheduled departure time
DepDelay: The number of minutes departure was delayed (flight that left ahead of schedule have a negative value)
DelDelay15: A binary indicator that departure was delayed by more than 15 minutes (and therefore considered "late")
CRSArrTime: The scheduled arrival time
ArrDelay: The number of minutes arrival was delayed (flight that arrived ahead of schedule have a negative value)
ArrDelay15: A binary indicator that arrival was delayed by more than 15 minutes (and therefore considered "late")
Cancelled: A binary indicator that the flight was cancelled

Exploration of the flight data to analyze possible factors that affect delays in departure or arrival of a flight.

Start by cleaning the data.
- Identify any null or missing data, and impute appropriate replacement values.
- Identify and eliminate any outliers in the DepDelay and ArrDelay columns.
Explore the cleaned data.
- View summary statistics for the numeric fields in the dataset.
- Determine the distribution of the DepDelay and ArrDelay columns.
- Use statistics, aggregate functions, and visualizations to answer the following questions:
  - What are the average (mean) departure and arrival delays?
  - How do the carriers compare in terms of arrival delay performance?
  - Is there a noticable difference in arrival delays for different days of the week?
  - Which departure airport has the highest average departure delay?
  - Do late departures tend to result in longer arrival delays than on-time departures?
  - Which route (from origin airport to destination airport) has the most late arrivals?
  - Which route has the highest average arrival delay?

Cleaning dataset¶

Identify any null or missing data, and impute appropriate replacement values.¶

# Checking null and missing data
flights.isnull().sum()

Year                    0
Month                   0
DayofMonth              0
DayOfWeek               0
Carrier                 0
OriginAirportID         0
OriginAirportName       0
OriginCity              0
OriginState             0
DestAirportID           0
DestAirportName         0
DestCity                0
DestState               0
CRSDepTime              0
DepDelay                0
DepDel15             2761
CRSArrTime              0
ArrDelay                0
ArrDel15                0
Cancelled               0
dtype: int64

There are 2761 null values in the variable DepDel15.

# Checking some lines with null values
flights[flights.isnull().any(axis=1)][['DepDelay','DepDel15']]

flights[flights.isnull().any(axis=1)].DepDelay.describe()

count    2761.0
mean        0.0
std         0.0
min         0.0
25%         0.0
50%         0.0
75%         0.0
max         0.0
Name: DepDelay, dtype: float64

NaN values in DepDel15 are from DepDelay with value 0. It is considered that if DepDelay is 0, then the flight was not delayed, so DepDelay15 will be filled with 0.

# Filling NA values in DepDel15 with 0
flights.DepDel15 = flights.DepDel15.fillna(0)

# Checking null and missing data again
flights.isnull().sum()
flights.head()

View summary statistics for the numeric fields in the dataset.¶

flights.describe()

Determine the distribution of the DepDelay and ArrDelay columns.¶

# Create a function showing distribution
def show_distribution(var_data):
    from matplotlib import pyplot as plt

    # Get statistics
    min_val = var_data.min()
    max_val = var_data.max()
    mean_val = var_data.mean()
    med_val = var_data.median()
    mod_val = var_data.mode()[0]

    print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\n'.format(min_val,
                                                                                            mean_val,
                                                                                            med_val,
                                                                                            mod_val,
                                                                                            max_val))

    # Create a figure for 2 subplots (2 rows, 1 column)
    fig, ax = plt.subplots(2, 1, figsize = (10,4))

    # Plot the histogram   
    ax[0].hist(var_data)
    ax[0].set_ylabel('Frequency')

    # Add lines for the mean, median, and mode
    ax[0].axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)

    # Plot the boxplot   
    ax[1].boxplot(var_data, vert=False)
    ax[1].set_xlabel('Value')

    # Add a title to the Figure
    fig.suptitle('Data Distribution')

    # Show the figure
    fig.show()

    # Get the variable to examine
    col = flights['DepDelay']
    # Call the function
    show_distribution(col)

Minimum:-63.00
Mean:10.35
Median:-1.00
Mode:-3.00
Maximum:1425.00

# Get the variable to examine
col = flights['ArrDelay']
# Call the function
show_distribution(col)

Minimum:-75.00
Mean:6.50
Median:-3.00
Mode:0.00
Maximum:1440.00

Identify and eliminate any outliers in the DepDelay and ArrDelay columns¶

# Removing Outliers - between 0.01 and 90 percentile
q90 = flights.DepDelay.quantile(0.90)
q01 = flights.DepDelay.quantile(0.01)

# Removing Outliers - between 0.01 and 90 percentile
q90a = flights.ArrDelay.quantile(0.90)
q01a = flights.ArrDelay.quantile(0.01)

# Filtered flights
flights = flights[(flights.DepDelay>q01) & (flights.DepDelay<q90)]
flights = flights[(flights.ArrDelay>q01a) & (flights.ArrDelay<q90a)]
flights.describe()

delayFields = ['DepDelay','ArrDelay']

# View the revised distributions
for col in delayFields:
    show_distribution(flights[col])

Minimum:-11.00
Mean:1.29
Median:-1.00
Mode:-3.00
Maximum:36.00

Minimum:-32.00
Mean:-3.10
Median:-5.00
Mode:0.00
Maximum:37.00

# Number of Flights by Day of Week
print(flights.groupby(flights.DayOfWeek).Year.count())

DayOfWeek
1    35330
2    35096
3    35344
4    34124
5    34033
6    28070
7    32805
Name: Year, dtype: int64

There are less flights on Saturdays. On Sat the number of flights is smaller.

# Cancelled flights
pass_counts = flights['Cancelled'].value_counts()
plt.pie(pass_counts, labels=pass_counts)
plt.legend(pass_counts.keys().tolist())

<matplotlib.legend.Legend at 0x7fabe0273278>

# Mean Dep Delay and Arr Delay by Day of Week
print(flights.groupby(flights.DayOfWeek)[delayFields].mean())
for col in delayFields:
    flights.boxplot(column=col, by='DayOfWeek', figsize=(8,8))

           DepDelay  ArrDelay
DayOfWeek                    
1          1.352109 -3.004444
2          0.772424 -3.838871
3          1.064679 -2.852733
4          2.021480 -1.744286
5          1.847148 -2.315723
6          0.766227 -4.609975
7          1.117543 -3.636915

How do the carriers compare in terms of arrival delay performance?¶

print(flights.groupby(flights.Carrier)[delayFields].mean())
for col in delayFields:
    flights.boxplot(column=col, by='Carrier', figsize=(8,8))

         DepDelay  ArrDelay
Carrier                    
9E      -1.183773 -6.071124
AA       1.441176 -3.882760
AS      -2.085376 -2.779544
B6      -0.049172 -3.906673
DL      -0.060244 -5.305459
EV      -0.115334 -4.862082
F9       1.636004  1.312562
FL       1.898731 -1.998780
HA      -2.331828 -1.753950
MQ       0.835789 -1.967846
OO      -0.787541 -2.750888
UA       1.923266 -5.740674
US      -1.167277 -2.836717
VX       0.817295 -3.971063
WN       5.038805  0.045423
YV      -0.823834 -2.469025

Which departure airport has the highest average departure delay?¶

departure_airport_group = flights.groupby(flights.OriginAirportName)

mean_departure_delays = pd.DataFrame(departure_airport_group['DepDelay'].mean()).sort_values('DepDelay', ascending=False)
mean_departure_delays.plot(kind = "bar", figsize=(12,12))
mean_departure_delays

Do late departures tend to result in longer arrival delays than on-time departures?¶

flights.boxplot(column='ArrDelay', by='DepDel15', figsize=(12,12))

<matplotlib.axes._subplots.AxesSubplot at 0x7fabdbaf3588>

Which route (from origin airport to destination airport) has the most late arrivals?¶

# Add a routes column
routes  = pd.Series(flights['OriginAirportName'] + ' -> ' + flights['DestAirportName'])
flights = pd.concat([flights, routes.rename("route")], axis=1)

# Sum arr delays per route
routearrdel = flights.groupby(flights.route)
pd.DataFrame(routearrdel['ArrDel15'].sum()).sort_values('ArrDel15', ascending=False)

Which route has the highest average arrival delay?¶

# get the mean value and order from higher to minor
pd.DataFrame(routearrdel['ArrDelay'].mean()).sort_values('ArrDelay', ascending=False)

# Checking DepDelay of cancelled flights
flights.boxplot(column='DepDelay', by='Cancelled', figsize=(8,5))

<matplotlib.axes._subplots.AxesSubplot at 0x7fabdba6f438>

	Year	Month	DayofMonth	DayOfWeek	OriginAirportID	DestAirportID	CRSDepTime	DepDelay	DepDel15	CRSArrTime	ArrDelay	ArrDel15	Cancelled
count	271940.0	271940.000000	271940.000000	271940.000000	271940.000000	271940.000000	271940.000000	271940.000000	271940.000000	271940.000000	271940.000000	271940.000000	271940.000000
mean	2013.0	6.978514	15.814775	3.899316	12744.899584	12739.136493	1325.498713	10.350449	0.199092	1505.477819	6.496650	0.215930	0.010723
std	0.0	1.982431	8.799033	1.987219	1501.199773	1502.569949	470.748546	35.673710	0.399318	493.204020	38.230331	0.411467	0.102995
min	2013.0	4.000000	1.000000	1.000000	10140.000000	10140.000000	1.000000	-63.000000	0.000000	1.000000	-75.000000	0.000000	0.000000
25%	2013.0	5.000000	8.000000	2.000000	11292.000000	11292.000000	920.000000	-4.000000	0.000000	1119.000000	-11.000000	0.000000	0.000000
50%	2013.0	7.000000	16.000000	4.000000	12892.000000	12892.000000	1320.000000	-1.000000	0.000000	1529.000000	-3.000000	0.000000	0.000000
75%	2013.0	9.000000	23.000000	6.000000	14057.000000	14057.000000	1725.000000	9.000000	0.000000	1918.000000	10.000000	0.000000	0.000000
max	2013.0	10.000000	31.000000	7.000000	15376.000000	15376.000000	2359.000000	1425.000000	1.000000	2359.000000	1440.000000	1.000000	1.000000

	DepDelay
OriginAirportName
Chicago Midway International	6.240506
Dallas Love Field	4.947875
William P Hobby	4.945270
Baltimore/Washington International Thurgood Marshall	3.805278
Denver International	3.681906
...	...
Tucson International	-1.372188
Richmond International	-1.609312
Kahului Airport	-1.852998
Ted Stevens Anchorage International	-1.955453
Long Beach Airport	-2.269333

	ArrDel15
route
San Francisco International -> Los Angeles International	146
Los Angeles International -> San Francisco International	123
McCarran International -> Los Angeles International	90
LaGuardia -> Hartsfield-Jackson Atlanta International	88
Los Angeles International -> McCarran International	80
...	...
Luis Munoz Marin International -> Ronald Reagan Washington National	0
Luis Munoz Marin International -> Jacksonville International	0
Luis Munoz Marin International -> Chicago O'Hare International	0
Tampa International -> Cleveland-Hopkins International	0
Louis Armstrong New Orleans International -> San Francisco International	0

	ArrDelay
route
Louis Armstrong New Orleans International -> Ronald Reagan Washington National	25.333333
Cleveland-Hopkins International -> Palm Beach International	18.000000
Denver International -> Raleigh-Durham International	14.153846
Charlotte Douglas International -> General Mitchell International	13.500000
Cleveland-Hopkins International -> Philadelphia International	12.800000
...	...
Lambert-St. Louis International -> Cleveland-Hopkins International	-20.000000
Jacksonville International -> Chicago Midway International	-20.000000
Eppley Airfield -> LaGuardia	-20.750000
Denver International -> Kahului Airport	-22.666667
Indianapolis International -> Logan International	-26.000000

	DepDelay	DepDel15
171	0	NaN
359	0	NaN
429	0	NaN
545	0	NaN
554	0	NaN
...	...	...
271410	0	NaN
271607	0	NaN
271634	0	NaN
271671	0	NaN
271885	0	NaN

	Year	Month	DayofMonth	DayOfWeek	Carrier	OriginAirportID	OriginAirportName	OriginCity	OriginState	DestAirportID	DestAirportName	DestCity	DestState	CRSDepTime	DepDelay	DepDel15	CRSArrTime	ArrDelay	ArrDel15
0	2013	9	16	1	DL	15304	Tampa International	Tampa	FL	12478	John F. Kennedy International	New York	NY	1539	4	0.0	1824	13	0
1	2013	9	23	1	WN	14122	Pittsburgh International	Pittsburgh	PA	13232	Chicago Midway International	Chicago	IL	710	3	0.0	740	22	1
2	2013	9	7	6	AS	14747	Seattle/Tacoma International	Seattle	WA	11278	Ronald Reagan Washington National	Washington	DC	810	-3	0.0	1614	-7	0
3	2013	7	22	1	OO	13930	Chicago O'Hare International	Chicago	IL	11042	Cleveland-Hopkins International	Cleveland	OH	804	35	1.0	1027	33	1
4	2013	5	16	4	DL	13931	Norfolk International	Norfolk	VA	10397	Hartsfield-Jackson Atlanta International	Atlanta	GA	545	-1	0.0	728	-9	0