Dataset Information

This dataset was produced by the National Oceanic and Atmospheric Administration’s (NOAA) Storm Prediction Center. It contains data on tornadoes that took place in the United States between 1950 and 2021.

Columns Used:

  • Tornado Date
  • Starting Location
  • Ending Location
  • Magnitude
  • Injury and Fatality Rates
import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/ProgramData/Anaconda3/Library/plugins/platforms'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter
import matplotlib.ticker as ticker
import geopandas as gpd
import folium

path = "U:/"
filename = "us_tornado_dataset_1950_2021_2.csv"
df = pd.read_csv(path+filename)
df = df.rename(columns={'yr': 'Year', 'mo': 'Month', 'dy': 'Day', 'st':'State', 'len':'Length', 'wid':"Width"})
df['date'] = pd.to_datetime(df['date'])
df['Month'] = df['date'].dt.month
df['Year'] = df['date'].dt.year
df['decade'] = (df['date'].dt.year //10) * 10
df['inj'].fillna(0, inplace=True)
df['fat'].fillna(0, inplace=True)
df['Total_Inj_Fat'] = df['inj'] + df['fat']

Findings

  1. Texas has the most tornado occurrences but Oklahoma, Kansas, and Florida have the most tornadoes per square mile.
  2. Since the 1950s, tornado lengths traveled has decreased but their width has increased.
  3. The most dangerous tornadoes occur in the spring - specifically April.
  4. Super outbreaks tend to occur in April.

Origin & End

Starting Points by County

# Load GeoData
geoData = gpd.read_file('https://raw.githubusercontent.com/holtzy/The-Python-Graph-Gallery/master/static/data/US-counties.geojson');

# Convert STATE to string
geoData['STATE'] = geoData['STATE'].astype(str);

# Remove states to be excluded
stateToRemove = ['02', '15', '72'];
geoData = geoData[~geoData.STATE.isin(stateToRemove)];

# Process tornado data
tornado_start = df.groupby(['scounty', 'State']).size().reset_index(name='Count');
tornado_start = tornado_start[~tornado_start['State'].isin(['DC', 'PR', 'AL', 'HI'])];
tornado_start = tornado_start[~tornado_start['scounty'].str.contains('xxx')];
tornado_start['scounty'] = tornado_start['scounty'].str.replace(' County', '', regex=True) \
                                                   .str.replace(' Parish', '', regex=True) \
                                                   .str.replace(' Region', '', regex=True) \
                                                   .str.replace('Saint ', 'St. ', regex=True);

# Remove invalid counties
invalid_counties = tornado_start[~tornado_start['scounty'].isin(geoData['NAME'])];
tornado_start = tornado_start[~tornado_start['scounty'].isin(invalid_counties['scounty'])];

# Merge tornado data with GeoData on county name
merged_data_start = geoData.merge(tornado_start, left_on='NAME', right_on='scounty', how='left');

# Create map
center_of_map = [38, -97];
start_map = folium.Map(location=center_of_map,
                       zoom_start=4,
                       tiles='cartodbpositron',
                       width='90%', height='100%',
                       left='5%', right='5%',
                       top='0%');

# Create choropleth map
ch_map_start = folium.Choropleth(geo_data=merged_data_start,
                                  name='choropleth',
                                  data=merged_data_start,
                                  columns=['scounty', 'Count'],
                                  key_on='feature.properties.NAME',
                                  fill_color='YlGnBu',
                                  fill_opacity=0.9,
                                  line_opacity=0.4,
                                  legend_name='Number of Tornadoes that Originated',
                                  highlight=True).add_to(start_map);

# Display tooltip
ch_map_start.geojson.add_child(folium.features.GeoJsonTooltip(fields=['NAME', 'State', 'Count'],
                                                              aliases=['County: ', 'State: ', 'Tornado Count: '],
                                                              labels=True,
                                                              style=('background-color:black; color:white;')));

start_map
Make this Notebook Trusted to load map: File -> Trust Notebook

Ending Points by County

# Process tornado data
tornado_end = df.groupby(['ecounty', 'State']).size().reset_index(name='Count');
tornado_end = tornado_end[~tornado_end['State'].isin(['DC', 'PR', 'AL', 'HI'])];
tornado_end = tornado_end[~tornado_end['ecounty'].str.contains('xxx')];
tornado_end['ecounty'] = tornado_end['ecounty'].str.replace(' County', '', regex=True) \
                                                   .str.replace(' Parish', '', regex=True) \
                                                   .str.replace(' Region', '', regex=True) \
                                                   .str.replace('Saint ', 'St. ', regex=True);

# Remove invalid counties
invalid_counties = tornado_end[~tornado_end['ecounty'].isin(geoData['NAME'])];
tornado_end = tornado_end[~tornado_end['ecounty'].isin(invalid_counties['ecounty'])];

# Merge tornado data with GeoData on county name
merged_data_end = geoData.merge(tornado_end, left_on='NAME', right_on='ecounty', how='left');

# Create map
center_of_map = [38, -97];
end_map = folium.Map(location = center_of_map,
                     zoom_start = 4,
                     tiles = 'cartodbpositron',
                     width = '90%', height = '100%',
                     left = '5%', right = '5%',
                     top = '0%');

# Create choropleth map
ch_map_end = folium.Choropleth(geo_data = merged_data_end,
                          name = 'choropleth',
                          data = merged_data_end,
                          columns = ['ecounty', 'Count'],
                          key_on = 'feature.properties.NAME',
                          fill_color = 'YlGnBu',
                          fill_opacity = 0.9,
                          line_opacity = 0.4,
                          legend_name = 'Number of Tornadoes that Ended',
                          highlight = True).add_to(end_map);

# Display tooltip
ch_map_end.geojson.add_child(folium.features.GeoJsonTooltip(fields=['NAME', 'State', 'Count'],
                                                            aliases=['County: ', 'State: ', 'Tornado Count: '],
                                                            labels=True,
                                                            style=('background-color:black; color:white;')));

end_map
Make this Notebook Trusted to load map: File -> Trust Notebook

Explanation of Visualization: These choropleth maps show the frequency of tornadoes originating and ending within each county across the contiguous 48 states. When hovering over a county the tooltip will show the name of the county, the state, and and the number of tornadoes that have either started or ended there depending on which map you are viewing.

Data Preparation: Not all counties were included in the Tornado dataset, resulting in approximately a dozen counties that did not correspond between the GeoJSON file and the dataset. As a result, you may notice the presence of black counties on the map.

Key Observations: The vast majority of counties on the map have had at least 1 tornado begin and end in it. The Midwest and Southeast have more green and blue counties compared to other areas of the country.

By State

# Filter out rows where 'mag' is not equal to -9
df_bar = df[df['mag'] != -9];

# Group by 'State' and 'mag'
state_mag_df = df_bar.groupby(['State', 'mag']).size().unstack(fill_value=0);

# Calculate the total tornado count for each state and sort in descending order
state_mag_df['Total'] = state_mag_df.sum(axis=1);
state_mag_df = state_mag_df.sort_values(by='Total', ascending=False);
state_mag_df = state_mag_df.head(10);

# Define state names
state_names = {'TX':'Texas', 'KS':'Kansas',
               'OK':'Oklahoma', 'FL':'Florida',
               'NE':'Nebraska', 'IA':'Iowa',
               'IL':'Illinois', 'MS':'Mississippi',
               'MO':'Missouri', 'AL':'Alabama',
              };

# Drop the 'Total' column
stacked_df = state_mag_df.drop(columns=['Total']);

# Create plot
fig = plt.figure(figsize=(30,20));
ax = fig.add_subplot(1, 1, 1);

stacked_df.plot(kind='bar', stacked=True, ax=ax)

# Set title and axis labels
plt.title('Top 10 States with Most Tornadoes and Magnitude Distribution\n Stacked Bar Plot', fontsize=50, pad=25);
plt.ylabel('Tornado Count', fontsize=40, labelpad=20);
ax.set_xlabel('State', fontsize=40, labelpad=20);
ax.set_xticklabels(state_names.values(), rotation=0, ha='center', fontsize=30);
plt.yticks(fontsize=30);


# Format y-axis tick labels with commas
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:,}'.format(int(x))));

# Set y-axis tick locations for multiples of 1,000
ax.yaxis.set_major_locator(ticker.MultipleLocator(1000));

# Custom legend labels
legend_labels = ['0 - Light Damage', '1 - Moderate Damage', '2 - Considerable Damage',
                 '3 - Severe Damage', '4 - Devastating Damage', '5 - Incredible Damage'];
legend = plt.legend(legend_labels, title='Magnitude', fontsize=30);
plt.setp(legend.get_title(), fontsize=30);

# Add labels to the end of each bar for the total tornado count
for i, total in enumerate(state_mag_df['Total']):
    ax.text(i, total + total * 0.01, f'{total:,}', ha='center', va='bottom', fontsize=25, fontweight='bold');

plt.show()

Explanation of Visualization: This stacked bar chart illustrates the tornado activity of the top 10 states from 1950 to 2021. Each state’s bar is segmented by tornado Magnitude, ranging from 0 to 5, distinguished by colors corresponding to the Enhanced Fujita Scale.

Data Preparation: This graph only includes tornadoes that had a rating specified in the dataframe since specific wind speeds were not given.

Key Observations: During this period, Texas experienced a notably higher number of tornadoes compared to any other state. Tornadoes classified as Magnitude 0 - resulting in Light Damage - were the most common, with a decline in frequency observed as the magnitude increased.

By Square Mile

# Convert 'date' column to datetime format
df['date'] = pd.to_datetime(df['date']);

# Define the top 5 states with the most tornadoes and map their square mileage
top_5_states = ['TX', 'KS', 'OK', 'FL', 'NE'];
state_square_miles = {'TX':268596,
                      'KS':82278,
                      'OK':69899,
                      'FL':65758,
                      'NE':77348};

# Create a new column 'decade'
df['decade'] = (df['date'].dt.year //10) * 10;


# Create dataframe for the bump chart
bump_df_filtered = df[df['State'].isin(top_5_states)];
bump_df = bump_df_filtered.groupby(['State', 'decade']).size().reset_index(name='Tornado_Count');
bump_df['Freq_Per_Area'] = bump_df.apply(lambda x: x['Tornado_Count'] / state_square_miles[x['State']], axis = 1);

bump_df = bump_df.pivot(index='decade', columns='State', values='Freq_Per_Area');
bump_df_ranked = bump_df.rank(1, ascending=False, method='min');

# Plotting bump chart
fig = plt.figure(figsize=(30,18));
ax = fig.add_subplot(1,1,1);

bump_df_ranked.plot(kind='line', ax=ax, marker='o', markeredgewidth=4,
                    linewidth=10, markersize=18, markerfacecolor='white');

# Add axes labels and title
plt.ylabel('Decade Ranking', fontsize=40, labelpad=20);
ax.set_xlabel('Decade', fontsize=40, labelpad=20);
plt.title('Ranking by Decade of Frequency of Tornadoes Per Square Mile \n Bump Chart', fontsize=50, pad=25);
plt.xticks(fontsize=30);
plt.yticks(fontsize=30);

# Flip y axis
ax.invert_yaxis();
ax.yaxis.set_major_locator(plt.MaxNLocator(integer=True));

# Reorder labels in legend and use full state names
handles, labels = ax.get_legend_handles_labels();
handles = [handles[3], handles[1], handles[2], handles[4], handles[0]];
labels = [state_names[label] for label in [labels[3], labels[1], labels[2], labels[4], labels[0]]];

# Move legend
ax.legend(handles, labels, bbox_to_anchor=(1, 1), fontsize=30,
         labelspacing=1, borderpad=1);

# After plotting, save the figure as an image
plt.savefig('bump_chart.png', bbox_inches='tight');

Bump Chart

Explanation of Visualization: This bump chart shows the ranking of the top 5 states with the most tornadoes in terms of frequency per square mile.

The previous bar chart revealed Texas had a markedly higher number of recorded tornadoes than any other state. However, does this suggest a higher likelihood of encountering a tornado in Texas, or was this due to Texas’ significantly larger size compared to the next four states with the highest tornado counts?

Data Preparation: After determining the top 5 states with the most recorded tornadoes, I mapped their square mileage onto each and added to the the bump chart.

Key Observations: Over the decades spanning from the 1950s to the 2020s, Texas consistently ranked lower in tornado frequency per area compared to Oklahoma, Florida, and Kansas.

By Size

# Create a pivot table for average length by month and decade
avg_wid = df.pivot_table(index='Month', columns='decade', values='Width', aggfunc='mean')
avg_wid.columns = avg_wid.columns.astype(str)

# Create plot
fig = plt.figure(figsize=(30,18));
ax = fig.add_subplot(1, 1, 1);

# Define colors for each decade
my_colors = {'1950':'red',
            '1960':'darkorange',
            '1970':'gold',
            '1980':'green',
            '1990':'blue',
            '2000':'purple',
            '2010':'magenta',
            '2020':'brown'};

# Loop through decades to plot average length
for decade in avg_wid.columns:
    plt.plot(avg_wid.index, avg_wid[decade], label=decade + 's', color=my_colors[decade],
             marker='8', markersize=12, linewidth=5);

# Set title, axis labels, and legend
plt.title('Average Width of Tornadoes Per Month Throughout the Decades', fontsize=50, pad=25);
plt.xlabel('Month', fontsize=40, labelpad=20);
plt.ylabel('Average Width', fontsize=40, labelpad=20);
plt.legend(title='Decade', fontsize=30, title_fontsize=30, bbox_to_anchor=(1, 0.9));

# add gridlines
plt.grid(True);

# Format x-axis tick labels with month abbreviations
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'];
plt.xticks(range(1, 13), months, fontsize=30);
plt.yticks(fontsize=30);

# Format y-axis tick labels with 'yds'
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{x:.0f} yds'));

plt.show()

# Create a pivot table for average length by month and decade
avg_len = df.pivot_table(index='Month', columns='decade', values='Length', aggfunc='mean')
avg_len.columns = avg_len.columns.astype(str)

# Create plot
fig = plt.figure(figsize=(30,18));
ax = fig.add_subplot(1, 1, 1);

# Loop through decades to plot average length
for decade in avg_len.columns:
    plt.plot(avg_len.index, avg_len[decade], label=decade + 's', color=my_colors[decade],
             marker='8', markersize=12, linewidth=5);

# Set title, axis labels, and legend
plt.title('Average Length Traveled Per Month Throughout the Decades', fontsize=50, pad=25);
plt.xlabel('Month', fontsize=40, labelpad=20);
plt.ylabel('Average Length', fontsize=40, labelpad=20);
plt.legend(title='Decade', fontsize=30, title_fontsize=30, bbox_to_anchor=(1, 0.9));

# Add grid lines
plt.grid(True);

# Format x-axis tick labels with month abbreviations
plt.xticks(range(1, 13), months, fontsize=30);
plt.yticks(fontsize=30);
ax.set_yticks(np.arange(11))

# Format y-axis tick labels with 'mi'
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{x:.0f} mi'));

plt.show()

Explanation of Visualizations: These multi-line plots show how the average width and distance traveled of tornadoes varies by month across different decades.

Key Observations: Based on these plots, winter months consistently exhibit a higher average width and distance covered by tornadoes compared to warmer months.

There are also distinct trends across decades - the 1950s saw greater average length traveled compared to the 2000s and later. On the other hand, the width of tornadoes since the 1950s has gotten larger.

By Casualties

# Create dataframe grouped by decade and month
scatter_df = df.groupby(['decade', 'Month'])['Total_Inj_Fat'].sum().reset_index();
scatter_df = pd.DataFrame(scatter_df);
scatter_df = scatter_df[scatter_df['decade'] != 2020]

# Create scatterplot
plt.figure(figsize=(18,10));
plt.scatter(scatter_df['Month'], scatter_df['decade'], marker='8', cmap='bwr', 
            c=scatter_df['Total_Inj_Fat'], s=scatter_df['Total_Inj_Fat']*.45, edgecolors='darkblue');
plt.title('Number of Casualties by Month and Decade', fontsize=25);
plt.xlabel('Months of the Year', fontsize=20, labelpad=20);
plt.ylabel('Decade', fontsize=20, labelpad=20);

# create colorbar 
cbar = plt.colorbar();
cbar.set_label('Number of Casualties', rotation=270, fontsize=14, color='black', labelpad=30);

# Show all months on x axis
my_x_ticks = [*range(scatter_df['Month'].min(), scatter_df['Month'].max()+1, 1)];
plt.xticks(my_x_ticks, fontsize=16);

plt.yticks(fontsize=16);

plt.show()

Explanation of Visualizations: This scatterplot with a colorbar shows the total number of fatalities and injuries that occurred during the decades.

Data Preparation: The 2020 decade was removed since the data only included numbers from 2020 and 2021.

Key Observations: April was the month with the most casualties, especially in the 1960s, 1970s, and 2010s.

Super Outbreaks

Super Tornado Outbreaks

1965
1974
2011
# Filter tornado data for the specified dates
super_outbreak_1965 = ['1965-04-11'];
super_outbreak_1974 = ['1974-04-03', '1974-04-04'];
super_outbreak_2011 = ['2011-04-25', '2011-04-26', '2011-04-27', '2011-04-28'];
super_outbreak_df_1965 = df[df['date'].isin(super_outbreak_1965)];
super_outbreak_df_1974 = df[df['date'].isin(super_outbreak_1974)];
super_outbreak_df_2011 = df[df['date'].isin(super_outbreak_2011)];

# Create the map
center_of_map = [38, -97];
my_map = folium.Map(location = center_of_map,
                   zoom_start = 4,
                   width = '90%', height = '100%',
                   left = '5%', right = '5%',
                   top = '0%');

# Add markers for 1965 tornadoes
for index, row in super_outbreak_df_1965.iterrows():
    radius = row['mag'] + 1
    tooltip = f"Date: {row['date'].strftime('%m/%d/%Y')}<br> Magnitude: {row['mag']}"
    folium.CircleMarker(location=[row['slat'], row['slon']], 
                        radius=radius, color='green', fill=True, fill_color='green', fill_opacity=0.7,
                        popup=tooltip).add_to(my_map);

# Add markers for 1974 tornadoes
for index, row in super_outbreak_df_1974.iterrows():
    radius = row['mag'] + 1
    tooltip = f"Date: {row['date'].strftime('%m/%d/%Y')}<br> Magnitude: {row['mag']}"
    folium.CircleMarker(location=[row['slat'], row['slon']], 
                        radius=radius, color='blue', fill=True, fill_color='blue', fill_opacity=0.7,
                        popup=tooltip).add_to(my_map);

    # Add markers for 2011 tornadoes
for index, row in super_outbreak_df_2011.iterrows():
    radius = row['mag'] + 1
    tooltip = f"Date: {row['date'].strftime('%m/%d/%Y')}<br> Magnitude: {row['mag']}"
    folium.CircleMarker(location=[row['slat'], row['slon']], 
                        radius=radius, color='red', fill=True, fill_color='red', fill_opacity=0.7,
                        popup=tooltip).add_to(my_map);

my_map
Make this Notebook Trusted to load map: File -> Trust Notebook
# Group data for 1965
pie_1965 = super_outbreak_df_1965.groupby(['mag'])['mag'].count().reset_index(name='Count');
pie_1965.sort_values(by=['mag'], inplace=True);
total_count_1965 = pie_1965['Count'].sum();

# Group data for 1974
pie_1974 = super_outbreak_df_1974.groupby(['mag'])['mag'].count().reset_index(name='Count');
pie_1974.sort_values(by=['mag'], inplace=True);
total_count_1974 = pie_1974['Count'].sum();

# Group data for 2011
pie_2011 = super_outbreak_df_2011.groupby(['mag'])['mag'].count().reset_index(name='Count');
pie_2011.sort_values(by=['mag'], inplace=True);
total_count_2011 = pie_2011['Count'].sum();

# Function to calculate wedge counts
def count_wedges(wedge_sizes):
    total = sum(wedge_sizes)
    return lambda pct: int(pct * total / 100);

# Dictionary to map magnitudes to colors
colors = {
    0: '#1f77b4',  # Blue
    1: '#ff7f0e',  # Orange
    2: '#2ca02c',  # Green
    3: '#d62728',  # Red
    4: '#9467bd',  # Purple
    5: '#8c564b',  # Brown
};

# Create charts
fig, axs = plt.subplots(1, 3, figsize=(12, 6));

# Plot for 1965
axs[0].pie(pie_1965['Count'], labels=['M' + str(mag) for mag in pie_1965['mag']], startangle=90,
           autopct=count_wedges(pie_1965['Count']), wedgeprops=dict(width=.60), pctdistance=0.85,
           colors=[colors.get(mag) for mag in pie_1965['mag']]);
axs[0].set_title('1965 Super Outbreak by Magnitude\nApril 11', size=15);
axs[0].text(0, 0, 'Total Count:\n' + str(total_count_1965), size=12, ha='center', va='center');

# Plot for 1974
axs[1].pie(pie_1974['Count'], labels=['M' + str(mag) for mag in pie_1974['mag']], startangle=90,
           autopct=count_wedges(pie_1974['Count']), wedgeprops=dict(width=.60), pctdistance=0.85);
axs[1].set_title('1974 Super Outbreak by Magnitude\nApril 3-4', size=15);
axs[1].text(0, 0, 'Total Count:\n' + str(total_count_1974), size=12, ha='center', va='center');

# Plot for 2011
axs[2].pie(pie_2011['Count'], labels=['M' + str(mag) for mag in pie_2011['mag']], startangle=90,
           autopct=count_wedges(pie_2011['Count']), wedgeprops=dict(width=0.60), pctdistance=0.85);
axs[2].set_title('2011 Super Outbreak by Magnitude\nApril 25-28', size=15);
axs[2].text(0, 0, 'Total Count:\n' + str(total_count_2011), size=12, ha='center', va='center');

plt.tight_layout();
plt.show()

Explanation of Visualizations: After trying out a few different heatmaps, I realized that in April during the years 1965, 1974, and 2011 had highly elevated numbers for tornado counts and casualities. After additional research, I learned there were ‘Super Outbreaks’ in April of each of those years and decided to dive in further.

Key Observations: Between the 3 Super Tornado Outbreaks, each has increased in the number of tornadoes, luckily the magnitude has not increased.

Fujita Scale

Enhanced Fujita Scale

Scale Wind Speed (mph) Damage
EF0 65-85 Light damage
EF1 86-110 Moderate damage
EF2 111-135 Considerable damage
EF3 136-165 Severe damage
EF4 166-200 Devastating damage
EF5 200+ Incredible damage

Wrap up

Summary

Several patterns emerged across the visualizations:

  1. Tornado Hotspots: The choropleth maps showing top tornado beginning and endings highlighted higher concentrations in the Midwest and Southeast but most counties in the US have at least 1 recorded tornado in the last 70 years.

  2. Tornado Alley: Tornado alley is generally considered to be Texas, through Oklahoma, Kansas, Nebraska, South Dakota, Iowa, Minnesota, Wisconsin, Illinois, Indiana, Missouri, Arkansas, North Dakota, Montana, Ohio, and eastern portions of Colorado and Wyoming. This data shows that many tornadoes occur in the Southwest.

  3. Size Over the Years: Since 1950 there has been a decrease in the average length traveled by tornadoes. Whether this is due to improvements in tornado detection rather than actual changes in tornado behavior is unclear.

Additional Research

Here are some additional questions that I came up with while working with the data. Some I was not able to get to, due to time restraints or the quality of the data, others were not available to look at in the data given.

  • What are the weather factors occurring that determine the size or magnitude of a tornado?

  • Has the amount of damage changed over time due to stronger infrastructure being developed specifically for tornado areas?

  • How has climate change affected the occurance or location of tornadoes?