This report provides a comprehensive visualization of flight delays across the United States in 2015. By leveraging a dataset of over 5 million flights and using Python for high-volume data processing, we aim to identify systemic bottlenecks affecting the aviation industry’s efficiency. While the dataset is from 2015, it offers an in-depth look at what air travel delays looked like ten years ago, which is relevant for comparison to today’s delays, caused by the prolonged government shutdowns and record-breaking airport traffic.
The 2015 monthly trend highlights a clear correlation between travel volume and flight delays, with prominent spikes occurring during the summer vacation months (June and July) and the December holiday season. These peaks suggest that airport infrastructure faces the most significant strain during periods of high consumer demand. On the other side of the coin, delays hit their lowest points in September and October, following the conclusion of the summer rush and with students heading back to school.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_full = pd.read_csv('flight delays.csv',low_memory=False)
df_delayed = df_full[df_full['DEPARTURE_DELAY'] > 0].copy()
monthly_counts = df_delayed.groupby('MONTH')['DEPARTURE_DELAY'].count()
plt.figure(figsize=(12, 6))
plt.plot(monthly_counts.index, monthly_counts.values, marker='o', color='#1f77b4', linewidth=2.5, markersize=8)
plt.title('Monthly Trend of Flight Delays (2015)', fontsize=15, fontweight='bold')
plt.xlabel('Month (1=Jan, 12=Dec)')
plt.ylabel('Total Number of Delayed Flights')
plt.xticks(range(1, 13));
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
The weekly distribution of delays reveals a consistent theme that aligns with the typical cycles of travel. We see that the highest average length of delays occurs on Mondays and Fridays, as the professional work week transition creates a massive surge in airport congestion. Saturdays typically show the lowest delay lengths, which is pretty surprising for a day on the weekend. This visualization proves that the day of the week is a key indicator of operational reliability, with mid-week flights offering a more stable travel window.
df_full = pd.read_csv('flight delays.csv')
## <string>:1: DtypeWarning: Columns (7,8) have mixed types. Specify dtype option on import or set low_memory=False.
df_delayed = df_full[df_full['DEPARTURE_DELAY'] > 0]
day_stats = df_delayed.groupby('DAY_OF_WEEK')['DEPARTURE_DELAY'].mean().sort_index()
day_map = {1: 'Mon', 2: 'Tue', 3: 'Wed', 4: 'Thu', 5: 'Fri', 6: 'Sat', 7: 'Sun'}
labels = [day_map[i] for i in day_stats.index]
plt.figure(figsize=(10, 6))
plt.bar(labels, day_stats.values, color='firebrick', edgecolor='black')
plt.title('Average Minutes of Flight Delays by Day of the Week', fontsize=14)
plt.xlabel('Day of the Week')
plt.ylabel('Average Delay (Minutes)')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
The heatmap shows that flight delays start almost at zero in the morning but pile up significantly as the day goes on. The worst times to fly are between 5 PM and 8 PM, when the map turns dark blue almost every day of the week. This happens because a single late flight early in the day ruins the schedule for every following flight, creating a massive backlog by the evening. If you want to actually leave on time, the data proves that flights before 8:00 AM are the only consistently safe bet.
df_full = pd.read_csv('flight delays.csv')
## <string>:1: DtypeWarning: Columns (7,8) have mixed types. Specify dtype option on import or set low_memory=False.
df_delayed = df_full[df_full['DEPARTURE_DELAY'] > 0].copy()
df_delayed['HOUR'] = df_delayed['SCHEDULED_DEPARTURE'] // 100
grid_counts = df_delayed.pivot_table(index='DAY_OF_WEEK',
columns='HOUR',
values='DEPARTURE_DELAY',
aggfunc='count')
grid_counts.index = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
plt.figure(figsize=(15, 8))
sns.heatmap(grid_counts, annot=False, cmap='Blues', cbar_kws={'label': 'Number of Delayed Flights'})
plt.title('Heatmap: Volume (Count) of Delayed Flights', fontsize=16, fontweight='bold')
plt.xlabel('Hour of the Day',)
plt.ylabel('Day of the Week',)
plt.show()
The bar chart shows that a massive chunk of all 2015 delays are concentrated in just a few major cities. Airports like ATL (Atlanta), ORD (Chicago), and DFW (Dallas) dominate the list because they handle the highest volume of international, continental, and connecting flights in the country. This concentration means that a single storm or technical glitch at one of these hubs can break the schedule for the rest of the national network. The data clearly shows that your risk of a delay is significantly higher if your flight originates from or passes through these specific high-traffic zones.
df_full = pd.read_csv('flight delays.csv')
## <string>:1: DtypeWarning: Columns (7,8) have mixed types. Specify dtype option on import or set low_memory=False.
df_delayed = df_full[df_full['DEPARTURE_DELAY'] > 0]
top_10_airports = df_delayed['ORIGIN_AIRPORT'].value_counts().head(10)
plt.figure(figsize=(12, 6))
top_10_airports.plot(kind='barh', color='orange', edgecolor='black')
plt.title('Top 10 Airports with the Most Frequent Delays (2015)', fontsize=14)
plt.xlabel('Total Number of Delayed Flights')
plt.ylabel('Airport Code')
plt.gca().invert_yaxis()
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.show()
The treemap shows that a huge portion of all delays are concentrated among just a few major airlines, with Southwest (WN) and Delta (DL) taking up the largest blocks. This happens because these carriers run the highest number of daily flights, making them statistically more likely to have more total delays than smaller airlines. Even though some of these airlines are massive, the data shows that Southwest, Delta, and United are responsible for a disproportionate amount of the total delays in the 2015 system. This visualization confirms that who you fly with is just as big a factor in your delay risk as the airport you’re flying through.
import squarify
import matplotlib.pyplot as plt
import seaborn as sns
df_full = pd.read_csv('flight delays.csv')
## <string>:1: DtypeWarning: Columns (7,8) have mixed types. Specify dtype option on import or set low_memory=False.
df_delayed = df_full[df_full['DEPARTURE_DELAY'] > 0].copy()
airline_map = {
'WN': 'Southwest', 'DL': 'Delta', 'UA': 'United', 'AA': 'American',
'OO': 'SkyWest', 'EV': 'Atlantic SE', 'B6': 'JetBlue',
'MQ': 'Envoy Air', 'US': 'US Airways', 'NK': 'Spirit'
}
top_10 = df_delayed['AIRLINE'].value_counts().head(10)
labels = [f"{airline_map.get(code, code)}\n({count:,})"
for code, count in zip(top_10.index, top_10.values)]
plt.figure(figsize=(14, 8))
squarify.plot(sizes=top_10.values, label=labels,
color=sns.color_palette("Spectral", 10),
alpha=0.8, edgecolor="white", linewidth=2)
plt.title('Top 10 Airlines by Total Number of Delays (2015)', fontsize=16, fontweight='bold')
plt.axis('off');
plt.show()
This analysis of over five million flights confirms that delays in 2015 were driven by a predictable combination of timing, geography, and carrier volume. The data proves that peak travel seasons in the summer and December, combined with the daily snowball effect in the late afternoon, created the highest risk for passengers. Geographically, a massive portion of these delays were concentrated at major hubs like Atlanta and Chicago, showing how a few specific bottlenecks can disrupt the entire national network. By using Python to break down these patterns, we can see that Southwest and Delta handled the largest share of these disruptions simply due to their massive scale. Ultimately, this 2015 dataset serves as a clear reminder of how much infrastructure strain existed a decade before the modern challenges we face in 2026.