Data Visualization


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import FuncFormatter
import plotly.graph_objects as go
import folium
import warnings
warnings.filterwarnings('ignore')

path = "C:/Users/jrodri57/Downloads/US_Accidents_March23.csv"

df = pd.read_csv(path, usecols=['State','Severity','Weather_Condition','Start_Time','Start_Lat','Start_Lng'])
df.Weather_Condition.fillna("Not Available", inplace=True)

df['Start_Time'] = df['Start_Time'].str.split('.').str[0]

df['Start_Time'] = pd.to_datetime(df['Start_Time'], format='%Y-%m-%d %H:%M:%S')

df['Month'] = df['Start_Time'].dt.month
df['MonthName'] = df['Start_Time'].dt.month_name()
df['Year'] = df['Start_Time'].dt.year
df['Day']  = df['Start_Time'].dt.day
df['Hour'] = df['Start_Time'].dt.hour
df['WeekDay'] = df['Start_Time'].dt.strftime('%a')

severity_mapping = {
    1: 'Low',
    2: 'Moderate',
    3: 'High',
    4: 'Critical'
}

df['Severity_Level'] = df['Severity'].map(severity_mapping)

filtered_df = df[df['Severity_Level'].isin(['High', 'Critical'])].copy()

Introduction

Let’s look at traffic accident patterns in Maryland using information from several graphs covering 2016 to 2023.

This overview will cover:

How serious accidents typically are in the state.

Where they happen most, focusing on Maryland roads and areas like those around Ellicott City.

When they are most common, looking at time of day, day of the week, and different months.

We’ll also touch on related factors like weather and how Maryland fits into the national picture. This summary helps to understand the main risks associated with driving in Maryland based on recent accident data.

Dataset

This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks.

This dataset was collected in real-time using multiple Traffic APIs. It contains accident data collected from February 2016 to March 2023 for the Contiguous United States. For more details about this dataset.

The dataset currently contains approximately 7.7 million accident records.

Dataset file size: 2.84 GB

Dataset Name: A Countrywide Traffic Accident Dataset (2016 - 2023)

Dataset Source: https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents

Findings

This is some general text about my findings before I show the individual charts in tabs. If you add .tabset-pills inside the curly braces, it will generate orange tab buttons

Accidents By State

This bar chart displays the Top 20 US States ranked by the number of severe traffic accidents (defined as ‘High and Critical’ severity levels) that occurred between 2016 and 2020.

Highest Counts: California (CA) has significantly more severe accidents than any other state, with 285,316 incidents. Texas (TX) and Florida (FL) follow, with 127,652 and 117,214 accidents, respectively.

Average: The average (mean) number of severe accidents for these top 20 states is 30,094 (indicated by the dashed line).

Maryland, highlighted in red, ranks 14th among these 20 states with 33,810 severe accidents, which is slightly above the average for this group.

x2 = filtered_df.groupby(['State']).agg({'State':['count']})
x2.columns = ['Count']
x2 = x2.sort_values('Count',ascending=False).reset_index()

fig2 = plt.figure(figsize=(15, 10))
ax2 = fig2.add_subplot(1, 1, 1)
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))

bar_colors = ['C0'] * len(x2['State'])
md_index = x2[x2['State'] == 'MD'].index[0]
bar_colors[md_index] = 'red'

bars = plt.bar(x2.loc[0:20, 'State'], x2.loc[0:20, 'Count'], color=bar_colors, label='State Count')

plt.xlabel('States')
plt.ylabel('Count of Accidents')
plt.title('Top 20 States - Accident Severity Level (High and Critical)\nUS Accidents (2016 - 2020)')

for bar in bars:
    yval = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2, yval, f'{yval:,}', ha='center', va='bottom')

plt.xticks(rotation=0, ha='center')

mean = x2['Count'].mean()

ax2.axhline(mean, color='black', linestyle='dashed')
ax2.text('OH', mean - 3500, 'Mean = ' + str(int(mean)), rotation=0, size=10, va='center', ha='right')

plt.show()

Accident Weather Conditions

This horizontal bar chart illustrates the top 15 weather conditions associated with severe traffic accidents (classified as ‘High’ and ‘Critical’ severity levels) in the United States. The data covers the period from 2016 to 2023.

Dominant Conditions: Contrary to what some might expect, the vast majority of severe accidents occur during seemingly non-hazardous weather. ‘Fair’ (approx. 300,000) and ‘Clear’ (approx. 271,000) conditions account for the highest number of severe accidents by a significant margin.

Cloudy Conditions: Various forms of cloudy weather (‘Mostly Cloudy’, ‘Partly Cloudy’, ‘Overcast’, ‘Cloudy’) also represent substantial portions of the severe accident counts, collectively contributing to hundreds of thousands of incidents.

Precipitation: Conditions involving precipitation generally rank lower. ‘Light Rain’ accounts for about 80,000 severe accidents, while ‘Light Snow’, ‘Rain’, ‘Heavy Rain’, and ‘Light Drizzle’ have progressively fewer counts within this top 15 list.

x1 = filtered_df.groupby(['Weather_Condition']).agg({'Weather_Condition':['count']}).reset_index()
x1.columns = ['Weather_Condition','Count']
x1 = x1.sort_values('Count',ascending=False).reset_index(drop=True)

fig = plt.figure(figsize=(15, 10))
ax1 = fig.add_subplot(1, 1, 1)
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: f'{int(x/1000)}k' if x >= 1000 else f'{int(x)}'))

ax1.xaxis.set_major_locator(ticker.MultipleLocator(10000))

bars = plt.barh(x1.loc[0:15, 'Weather_Condition'][::-1], x1.loc[0:15, 'Count'][::-1], label='Weather Condition Count')

plt.ylabel('Weather Condition')
plt.xlabel('Count of Accidents')
plt.title('Top 15 Accident Weather Conditions - Severity Level (High and Critical)\nUS Accidents (2016 - 2023)')

for bar in bars:
    xval = bar.get_width()
    plt.text(xval, bar.get_y() + bar.get_height()/2, f'{xval:,}', ha='left', va='center')

plt.xticks(rotation=45, ha='center', fontsize=10)

plt.show()

MD Accidents Map

This Map pinpointing the locations of severe traffic accidents within and around Maryland that occurred between 2016 and 2023.

Data Representation: Each dot on the map signifies the location of a reported accident.

Severity Coding: The color of the dot indicates the severity level of the accident:

Red dots = ‘Critical’ severity accidents

Orange dots = ‘High’ severity accidents

Purpose: The map visually represents the geographic distribution and concentration hot spots for these serious accidents across the state and its immediate vicinity during the specified timeframe.

The map reveals distinct patterns in the distribution of high and critical severity accidents in Maryland from 2016 to 2023:

Concentration along Highways: Accidents are heavily concentrated along major transportation corridors, particularly interstate highways like I-95, I-70, I-695 (Baltimore Beltway), I-495 (Capital Beltway), I-270, and US-50/301.

Urban/Suburban Density: The highest density of both critical (red) and high (orange) severity accidents occurs in the densely populated Baltimore-Washington metropolitan area. This includes significant clusters around Baltimore City, Washington D.C., and the surrounding suburban counties.

filtered_df_MD = filtered_df[filtered_df['State'].isin(['MD'])].copy().reset_index(drop=True)
df_MD = df[df['State'].isin(['MD'])].copy().reset_index(drop=True)
center_of_map = [38.842393, -77.390414] # Penn Station (Amtrack) Baltimore City

my_map = folium.Map(location = center_of_map, zoom_start = 9, width = '90%',
                    height = '100%', left = '5%', right = '5%', top = '0%')

tiles = [('Cartodb Positron'), ('OpenTopoMap'),('Cartodb dark_matter'),
        ('CyclOSM'),('OpenStreetMap.Mapnik')]

for tile_name in tiles:
    folium.TileLayer(tile_name).add_to(my_map)

folium.LayerControl().add_to(my_map)

title_html = '<h3 align="center" style="font-size:20px">Maryland Accidents Locations (2016 - 2023)<br> Severity Level <span style="color:red;"> Critical = Red</span> , <span style="color:orange;">High = Orange</span></h3>'
my_map.get_root().html.add_child(folium.Element(title_html))

for i in range(0, len(filtered_df_MD)):
  try:
      severity = filtered_df_MD.loc[i, 'Severity_Level']
      if severity == 'Critical':
        color = 'red'
      else:
        color = 'orange'
      folium.Circle(location = [filtered_df_MD.loc[i, 'Start_Lat'], filtered_df_MD.loc[i, 'Start_Lng']],
                    radius = 50,
                    color = color,
                    fill = True,
                    fill_color = color,
                    fill_opacity = 0.5).add_to(my_map)
  except:
    pass

  
#my_map.save('C:/Users/jrodri57/Downloads/US_Accidents.html')

my_map

Make this Notebook Trusted to load map: File -> Trust Notebook

MD Accident Severity

This donut chart presents the distribution of reported traffic accidents in Maryland based on their assigned severity level. The data covers all recorded incidents from 2016 to 2023, totaling approximately 140,420 accidents.

Most Common: The overwhelming majority of accidents were classified as Moderate severity, accounting for 73.8% of all incidents (approximately 103,557 accidents).

Significant Minority: High severity accidents represent the next largest group, making up 18.1% of the total (approximately 25,450 accidents).

In essence, while severe accidents (High and Critical) represent a significant number of incidents, they are substantially outnumbered by accidents classified as Moderate in Maryland during this period.

pie_df = df_MD.groupby(['Severity_Level']).agg({'Severity_Level':['count']}).reset_index()
pie_df.columns = ['Severity_Level','count']
outside_color_ref_number = np.arange(4) * 4

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)

colormap = plt.cm.get_cmap('tab20c')
outer_colors = colormap(outside_color_ref_number)

total_count = pie_df['count'].sum()

def autopct_format(pct):
    absolute = int(pct / 100. * total_count)
    return "{:.1f}%\n({:,d})".format(pct, absolute)

pie_df.plot(kind= 'pie', radius = 1, colors = outer_colors, pctdistance = 0.75, y='count', labels=pie_df['Severity_Level'], ax=ax,
            labeldistance = 1.05, wedgeprops = dict(edgecolor = 'white'), textprops = {'fontsize':12},
            autopct=autopct_format, startangle = 90)

hole = plt.Circle((0,0), 0.3, fc = 'white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

plt.title('Maryland Accident Severity Level Distribution\n(2016 - 2023)', fontsize = 16)
plt.ylabel('')

ax.text(0, 0, 'Total Accidents\n' + f'{total_count/1000:.2f}k', ha = 'center', va = 'center', fontsize = 14)

ax.axis('equal')

plt.tight_layout()

plt.show()

MD Accidents by Hour

This line chart illustrates the total number of traffic accidents reported in Maryland, broken down by the hour of the day and the specific day of the week. The data covers the period from 2016 to 2023.

Weekday Pattern (Monday-Friday):

Accident counts are lowest in the early morning hours (approx. 2-4 AM).

There is a distinct morning peak, generally occurring around 7-8 AM, corresponding to the morning commute.

A larger, more pronounced peak occurs during the afternoon/evening commute, typically between 3 PM (15:00) and 6 PM (18:00), with the absolute highest numbers often seen around 4 PM (16:00) or 5 PM (17:00).

Friday (purple line) tends to have the highest overall afternoon peak and maintain high accident counts later into the evening compared to other weekdays.

Weekend Pattern (Saturday-Sunday):

The distinct morning commute peak seen on weekdays is largely absent.

Accident counts start low and gradually increase throughout the day.

The peak times on weekends are generally broader and occur later in the afternoon and evening compared to weekday peaks. Saturday (yellow line) often shows higher counts than Sunday (brown line), particularly in the afternoon and evening.

Weekend nights, especially late Saturday/early Sunday, show relatively higher accident frequencies compared to weekday nights after the evening commute peak drops off.

Overall Lowest Point: Across all days, the fewest accidents consistently occur in the very early morning hours, typically around 3-4 AM.

incident_df = df_MD.groupby(['Hour', 'WeekDay']).size().reset_index(name='Count')

fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)

my_colors = {'Mon':'blue', 'Tue':'red', 'Wed':'green', 'Thu':'gray', 'Fri':'purple', 'Sat':'gold', 'Sun':'brown'}

for key, grp in incident_df.groupby('WeekDay'):
    grp.plot(ax = ax, kind = 'line', x = 'Hour', y = 'Count', color = my_colors[key], label = key, marker = '8')

plt.title('Maryland Total Accidents by Hour\n(2016 - 2023)', fontsize = 18)
ax.set_xlabel('Hour (24 Hour Interval)', fontsize = 18)
ax.set_ylabel('Total Accidents', fontsize = 18, labelpad=20)
ax.tick_params(axis='both', labelsize=14, rotation = 0)
ax.set_xticks(np.arange(0, 24, 1))

handles, labels = ax.get_legend_handles_labels()

handles = [handles[1], handles[5], handles[6], handles[4], handles[0], handles[2], handles[3]]
labels = [labels[1], labels[5], labels[6], labels[4], labels[0], labels[2], labels[3]]
plt.legend(handles, labels, loc = 'best', fontsize = 14, ncol = 1)

ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:('%1.0f')%(x)))

plt.show()

MD Accidents(Waterfall Diagram)

This waterfall chart illustrates how the monthly count of severe (classified as High and Critical) traffic accidents in Maryland compares to the average monthly number of these accidents. The data spans the years 2016 to 2023.

Above Average Months: Severe accident counts tend to be higher than the monthly average (~3,029) during the spring and late in the year. Months showing a surplus (green bars) are January (3,126), March (3,114), April (3,226), May (3,345), November (3,113), and December (3,243). May experienced the highest number of severe accidents.

Below Average Months: Counts were lower than average in late winter, summer, and early fall. Months showing a deficit (red bars) are February (2,858), June (2,808), July (2,868), August (2,838), September (2,823), and October (2,980). September had the lowest number of severe accidents during this period.

Seasonal Trend: The data suggests a trend of increasing severe accidents through the spring months peaking in May, followed by a general decline through summer and early fall, before rising again in November and December. February is notably lower than the months immediately surrounding it.

*** Plotly is causing issues on how the Waterfall Plot displays, the correct plot is the third one. Tried converting the plot to an html and png file, Iframe does not display and the png file only adds another plot displaying 4.***

wf_df = filtered_df_MD.groupby(['MonthName']).agg({'Severity_Level':['count']}).reset_index()
wf_df.columns = ['MonthName','Count']
mean_count = wf_df['Count'].mean()
wf_df['Mean'] = mean_count
wf_df['Deviation'] = wf_df['Count'] - mean_count
wf_df.loc[wf_df.index.max() + 1] = ['Total', wf_df['Count'].sum(), wf_df.Mean.sum(), wf_df.Deviation.sum()]
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December','Total']
wf_df.MonthName = pd.Categorical(wf_df.MonthName, categories = months, ordered = True)
wf_df.sort_values('MonthName', inplace=True)
wf_df.reset_index(inplace=True, drop=True)

from pickle import TRUE

if wf_df.loc[12, 'Deviation'] > 0:
  end_color = 'black'
elif wf_df.loc[12, 'Deviation'] < 0:
  end_color = 'red'
else: end_color = 'blue'

fig = go.Figure(go.Waterfall( name = '', orientation = 'v', x = wf_df['MonthName'], textposition = 'outside',
                             measure = ['relative','relative','relative','relative','relative','relative',
                                        'relative','relative','relative','relative','relative','relative','total'],
                              y = wf_df['Deviation'], text = ['{:,.0f}'.format(each) for each in wf_df['Count']],
                              decreasing = {'marker':{'color':'red'}},
                              increasing = {'marker':{'color':'green'}},
                              totals = {'marker':{'color': end_color}},
                              hovertemplate = 'Cumulative Deviation to Date: ' + '%{y:,.0f}' + '<br>' +
                                              'Total Accidents In %{x}: %{text}'
                              )
                )

fig.layout = go.Layout(yaxis=dict(tickformat = '.0f'))

fig.update_xaxes(title_text = 'Months', title_font = {'size':16})

fig.update_yaxes(title_text = 'Total Accidents (Running Total)', title_font = {'size':18}, dtick = 50,
                 zeroline = True)

fig.update_layout(title = dict(text = 'Deviation between Accidents and Monthly Mean (Waterfall Diagram)<br>' +
                               'Accident Severity Level (High and Critical) (2016 - 2023)<br>' +
                               'Surpluses appear in Green, Dificits apperar in Red',
                               font = dict(family = 'Arial', size = 16, color = 'black')),
                  template = 'simple_white', title_x = 0.5, showlegend = False, autosize = True,
                  margin = dict(l = 30, r = 30, t = 90, b = 30), height = 600, width = 800)


#fig.show()

None

Conclusion

Traffic accidents in Maryland, especially severe ones, exhibit clear patterns related to location, time of day, day of week, and season. The highest risk areas are major highways and the Baltimore-Washington corridor. The highest risk times are weekday afternoon commutes (especially Fridays) and specific months (May, Nov, Dec for severe incidents). While most accidents are moderate, the substantial number of severe incidents, often occurring in clear weather and concentrated in specific areas and times, highlights ongoing traffic safety challenges. Understanding these patterns is crucial for targeted safety initiatives and driver awareness, particularly for residents navigating high-risk areas.

Data Visualization - Python

Jose Rodriguez-Justiniano

2025-04-05