This project examines country-wide car accident data for the Contiguous United States from 2016 to 2020. Analysis of car accident data allows for trends to be identified that might indicate causes or predictors of car accidents and their severity. Determining what factors can influence the likelihood and severity of a car accident as well as what types of locations might be hotpots for accidents may allow for steps to be taken to reduce risk of accidents. This information may be useful in the creation or alteration of traffic laws or for predicting the probability of an accident at a given location given a specific time of day or year.
Source: https://www.kaggle.com/sobhanmoosavi/us-accidents/data
This data set was created using APIs that broadcast traffic data collected from numerous sources including U.S. and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The data includes 4,232,541 accident records from the contiguous U.S., each of which is represented by a row in the data set. The data set also has 49 columns to identify different attributes and traits of each accident. The columns that were used for this analysis include information about the date and time of the accident, the location of the accident (latitude and longitude), and the severity of the accident.
These plots show us that there is a long-tail effect in the total number of accidents per state from 2016 to 2020. The top six states with the most accidents seem to have had significantly more accidents within this time period than the other states in the data set. California, the number one state, in particular had drastically more accidents than the other states with 972,585 accidents while the mean number of accidents for all 49 states is only 86,378.39. This significant skew in accidents per state is especially apparent when comparing the mean of 86,378.39 for all 49 states to the much higher mean of 279,890.5 for the top 10 states.
import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'c:/ProgramData/Anaconda3/Library/plugins/platforms'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
path = "U:/"
filename = path + 'US_Accidents_Dec20.csv'
df = pd.read_csv(filename)
df = pd.read_csv(filename, usecols = ['ID','State', 'Start_Lat', 'Start_Lng', 'Severity', 'Start_Time', 'Temperature(F)', 'Weather_Condition', 'Visibility(mi)'])
df['Start_Time'] = pd.to_datetime(df['Start_Time'], format = '%Y-%m-%d %H:%M:%S')
df['Year'] = df['Start_Time'].dt.year
df['Month'] = df['Start_Time'].dt.month
df['Day'] = df['Start_Time'].dt.day
df['Hour'] = df['Start_Time'].dt.hour
df['DayOfTheWeek'] = df['Start_Time'].dt.dayofweek
df['WeekDay'] = df['Start_Time'].dt.strftime('%A')
df['MonthName'] = df['Start_Time'].dt.strftime('%B')
df['DayOfTheYear'] = df['Start_Time'].dt.dayofyear
x = df.groupby(['State']).agg({'State':['count'], 'Severity':['mean']}).reset_index()
x.columns = ['States', 'AccidentCount', 'AvgSeverity']
x = x.sort_values('AccidentCount', ascending = False)
x.reset_index(inplace = True, drop=True)
def pick_colors_according_to_mean_count(this_data):
colors=[]
avg = this_data.AccidentCount.mean()
for each in this_data.AccidentCount:
if each > avg*1.05:
colors.append('midnightblue')
elif each <avg*0.95:
colors.append('steelblue')
else:
colors.append('turquoise')
return colors
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter
bottom1 = 0
top1 = 49
d1 = x.loc[bottom1:top1]
my_colors1 = pick_colors_according_to_mean_count(d1)
mean_lab1 = "{:,}".format(round(d1.AccidentCount.mean(),2))
bottom2 = 0
top2 = 9
d2 = x.loc[bottom2:top2]
d2 = d2.sort_values('AccidentCount', ascending=True)
d2.reset_index(inplace=True, drop=True)
my_colors2 = pick_colors_according_to_mean_count(d2)
mean_lab2 = "{:,}".format(round(d2.AccidentCount.mean(),2))
Above = mpatches.Patch(color='midnightblue', label='Above Average')
At = mpatches.Patch(color='turquoise', label='Within 5% of the Average')
Below = mpatches.Patch(color='steelblue', label='Below Average')
fig1 = plt.figure(figsize=(18, 16))
fig1.suptitle('Frequency of Accidents Analysis by State:\n All ' + str(top1) + ' and Top ' + str(top2),
fontsize= 18, fontweight='bold')
ax1 = fig1.add_subplot(2, 1, 1)
ax1.bar(d1.States, d1.AccidentCount, label='Accident Count', color = my_colors1)
ax1.legend(handles=[Above, At, Below], fontsize=14)
plt.axhline(d1.AccidentCount.mean(), color = 'black', linestyle = 'dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.set_title('All '+ str(top1) +' States', size=20)
ax1.set_xlabel('State', fontsize = 16)
ax1.set_ylabel('Accident Count (Millions)', fontsize = 16)
ax1.text(top1-10, d1.AccidentCount.mean()+20000, 'Mean = ' + str(mean_lab1), rotation = 0, fontsize = 14)
ax1.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:('%1.1fM')%(x*1e-6)))
ax2 = fig1.add_subplot(2, 1, 2)
ax2.barh(d2.States, d2.AccidentCount, color = my_colors2)
for row_counter, value_at_row_counter in enumerate(d2.AccidentCount):
if value_at_row_counter > d2.AccidentCount.mean()*1.05:
color = 'midnightblue'
elif value_at_row_counter < d2.AccidentCount.mean()*0.95:
color = 'steelblue'
else:
color = 'turquoise'
ax2.text(value_at_row_counter+5000, row_counter, str("{:,}".format(value_at_row_counter)), color = color, size=12, fontweight='bold',
ha='left', va='center', backgroundcolor = 'white')
plt.xlim(0, d2.AccidentCount.max()*1.1) # increasing width of fig
ax2.legend(loc = 'lower right', handles = [Above, At, Below], fontsize = 14)
plt.axvline(d2.AccidentCount.mean(), color = 'black', linestyle = 'dashed')
ax2.text(d2.AccidentCount.mean()+3000, 0, 'Mean = ' + str(mean_lab2), rotation=0, fontsize=14)
ax2.set_title('Top ' + str(top2+1) + ' States', size = 20)
ax2.set_xlabel('Accident Count (Millions)', fontsize = 16)
ax2.set_ylabel('State', fontsize = 16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
ax2.xaxis.set_major_formatter(FuncFormatter(lambda x, pos:('%1.1fM')%(x*1e-6)))
fig1.subplots_adjust(hspace=0.35)
plt.show()
The dual axis plot below indicates that, at least for the top 10 states, average accident severity is consistently between about 2 and 2.5 on a scale with 1 being the least severe and 4 being the most severe. This is a rather consistent level of average severity despite the large differences in the number of accidents per state between these states. To emphasize the variation in accidents per state, it is worth noting that the number 1 state, California, has an accident count of 972.6K while the number 10 state, Oregon, has an accident count of only 108.4K.
def autolabel(these_bars, this_axis, place_of_decimals, symbol, symbol2):
for each_bar in these_bars:
height = each_bar.get_height()
this_axis.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01, symbol+format(height, place_of_decimals)+symbol2,
fontsize=11, color='black', ha='center', va='bottom')
fig = plt.figure(figsize=(18,10))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = ax1.twinx()
bar_width = 0.4
x_pos = np.arange(10)
count_bars = ax1.bar(x_pos-(0.5*bar_width), (d2.AccidentCount)/1000, bar_width, color = 'steelblue', edgecolor='black', label='Accident Count')
aver_severity_bars = ax2.bar(x_pos+(0.5*bar_width), d2.AvgSeverity, bar_width, color = 'midnightblue', edgecolor='black', label='Average Severity')
ax1.set_xlabel('State', fontsize=18)
ax1.set_ylabel('Count of Accidents (Thousands)', fontsize=18, labelpad=20)
ax2.set_ylabel('Average Severity', fontsize =18, rotation=270, labelpad=20)
ax1.tick_params(axis='y', labelsize=14)
ax2.tick_params(axis='y', labelsize=14)
plt.title('Accident Count and Average Severity Analysis:\n States with Top 10 Highest Accident Counts', fontsize=18)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(d2.States, fontsize=14)
count_color, count_label = ax1.get_legend_handles_labels()
severity_color, severity_label = ax2.get_legend_handles_labels()
legend = ax1.legend(count_color + severity_color, count_label + severity_label, loc='upper left', frameon=True, ncol=1, shadow=True,
borderpad=1, fontsize=14)
ax2.set_ylim(0, (d2.AvgSeverity.max())*1.5)
autolabel(count_bars, ax1, '.1f', '', 'k')
autolabel(aver_severity_bars, ax2, '.2f', '', '')
plt.show()
This plot illustrates two spikes in the number of accidents at the times that people commute as well as a weekend effect. The most significant spike in accidents occurs on weekdays from 7:00 AM to 8:00 AM. The second and slightly less dramatic spike occurs on weekdays from 4:00 PM to 5:00 PM. These peaks indicate that the higher amounts of traffic during rush hour to and from work contribute to higher numbers of accidents. The higher number of accidents during the morning rush hour could mean that people tend to rush to work, increasing the risk of an accident, but tend to not rush as much to get home. This trend is not present on the weekends and overall there are far fewer accidents on Saturdays and Sundays. This is likely due to a reduction in traffic when less people need to drive to work all at the same time.
accident_day_df = df.groupby(['Hour', 'WeekDay'])['ID'].count().reset_index(name='TotalAccidents')
from matplotlib.ticker import FuncFormatter
fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot(1, 1, 1)
my_colors = {'Monday':'blue',
'Tuesday':'red',
'Wednesday':'green',
'Thursday':'pink',
'Friday':'orange',
'Saturday':'gray',
'Sunday':'brown'}
for key, grp in accident_day_df.groupby(['WeekDay']):
grp.plot(ax=ax, kind='line', x='Hour', y='TotalAccidents', color=my_colors[key], label=key, marker='8')
plt.title('Total Accidents by Hour by Day of Week', fontsize=18)
ax.set_xlabel('Hour (24 Hour Interval)', fontsize=18)
ax.set_ylabel('Total Accidents', fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)
ax.set_xticks(np.arange(24))
handles, labels = ax.get_legend_handles_labels()
handles = [handles[1],handles[5],handles[6],handles[4],handles[0],handles[2],handles[3]]
labels = [labels[1], labels[5],labels[6],labels[4], labels[0], labels[2], labels[3]]
plt.legend(handles, labels, loc='best', fontsize=14, ncol=1)
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:format(int(x), ',')%(x)))
plt.show()
From this line plot we are able to see that the number of accidents per year tends to increase every year. We cannot determine what causes this trend from this data but increases in the number of drivers and in the number of distractions (phones, social media, etc.) are potential contributors that may be worth considering. Additionally, there appears to be a slight increase in accidents per month as it gets later in the year. This is an interesting trend worth looking further into but it could potentially be due to travel being low at the beginning of the year and increasing later in the summer and during the holidays. This trend could also be caused by weather conditions changing throughout the year. There also appears to be an interesting COVID effect in 2020. As expected, the number of accidents per month increased relative to previous years at the beginning of 2020 but then drastically decreased below 2016 levels in July and August before rapidly increasing to a level far above numbers for any other year in November and December. This could be due to people driving less when the pandemic became more serious but then driving significantly more, and being out of practice, after several months of dealing with COVID restrictions.
month_df = df.groupby(['Year', 'Month'])['ID'].count().reset_index(name='TotalAccidents')
month_df['Year'] = month_df['Year'].astype(str)
from matplotlib.ticker import FuncFormatter
fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot(1, 1, 1)
my_colors2 = {'2016':'blue',
'2017':'red',
'2018':'green',
'2019':'gold',
'2020':'brown'}
for key, grp in month_df.groupby(['Year']):
grp.plot(ax=ax, kind='line', x='Month', y='TotalAccidents', color=my_colors2[key], label=key, marker='8')
plt.title('Total Accidents by Month by Year', fontsize=18)
ax.set_xlabel('Month', fontsize=18)
ax.set_ylabel('Total Accidents', fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)
ax.set_xticks(np.arange(13))
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:format(int(x), ',')%(x)))
plt.show()
The nested pie chart below reinforces the findings of the previous line plot that the frequency of accidents seems to increase gradually throughout each year. This trend is apparent from the slight increase in the percentage of total accidents accounted for in each quarter. 20.78% of the total accidents occurred in Quarter 1 of their respective years while 34.98% of the total accidents occurred in Quarter 4. By looking at the percentage of total accidents from month to month we see that there is greater deviation from this pattern but the trend is still there. There are increases and decreases in the number of accidents from month to month but there is still an overall upward trend as seen by the difference between 7.13% of accidents occurring in January and 12.33% occurring in December.
df['MonthAbbrev'] = df['Start_Time'].dt.strftime('%b')
df['Quarter'] = 'Quarter' + df.Start_Time.dt.quarter.astype('string')
pie_df = df.groupby(['Quarter', 'MonthAbbrev', 'Month'])['ID'].count().reset_index(name='TotalAccidents')
pie_df.sort_values(by=['Month'], inplace = True)
pie_df.reset_index(inplace = True, drop=True)
del pie_df['Month']
number_outside_colors = len(pie_df.Quarter.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4
number_inside_colors = len(pie_df.MonthAbbrev.unique())
all_color_ref_number = np.arange(number_outside_colors + number_inside_colors)
inside_color_ref_number = []
for each in all_color_ref_number:
if each not in outside_color_ref_number:
inside_color_ref_number.append(each)
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(1, 1, 1)
colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)
all_accidents = pie_df.TotalAccidents.sum()
pie_df.groupby(['Quarter'])['TotalAccidents'].sum().plot(
kind = 'pie', radius=1, colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1,
wedgeprops = dict(edgecolor='w'), textprops={'fontsize':16},
autopct = lambda p: '{:.2f}%\n({:.1f}K)'.format(p,(p/100)*all_accidents/1e3),
startangle=90)
inner_colors = colormap(inside_color_ref_number)
pie_df.TotalAccidents.plot(
kind = 'pie', radius=0.7, colors = inner_colors, pctdistance = 0.6, labeldistance = 0.8,
wedgeprops = dict(edgecolor='w'), textprops={'fontsize':13},
labels = pie_df.MonthAbbrev,
autopct = '%1.2f%%',
startangle=90)
hole = plt.Circle((0,0), 0.3, fc = 'white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.yaxis.set_visible(False)
plt.title('Total Accidents by Quarter and Month', fontsize = 18)
ax.text(0, 0, 'Total Accidents\n' + str(round(all_accidents/1e6,2)) + 'M', size=18, ha='center', va='center')
ax.axis('equal')
plt.tight_layout()
plt.show()
sev_min = df['Severity'].min()
sev_max = df['Severity'].max()
avg_sev = df['Severity'].mean()
print('Minimum Severity = ', sev_min)
## Minimum Severity = 1
print('Maximum Severity = ', sev_max)
## Maximum Severity = 4
print('Average Severity = ', avg_sev)
## Average Severity = 2.3050349659932414
print()
state_min = x['AccidentCount'].min()
state_max = x['AccidentCount'].max()
avg_state = x['AccidentCount'].mean()
avg_state_sev = x['AvgSeverity'].mean()
print('Minimum Accidents per State = ', state_min)
## Minimum Accidents per State = 220
print('Maximum Accidents per State = ', state_max)
## Maximum Accidents per State = 972585
print('Average Accidents per State = ', avg_state)
## Average Accidents per State = 86378.38775510204
print('Average per State Severity = ', avg_state_sev)
## Average per State Severity = 2.3692962775199073
print()
print('Data Types: ')
## Data Types:
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 4232541 entries, 0 to 4232540
## Data columns (total 19 columns):
## # Column Dtype
## --- ------ -----
## 0 ID object
## 1 Severity int64
## 2 Start_Time datetime64[ns]
## 3 Start_Lat float64
## 4 Start_Lng float64
## 5 State object
## 6 Temperature(F) float64
## 7 Visibility(mi) float64
## 8 Weather_Condition object
## 9 Year int64
## 10 Month int64
## 11 Day int64
## 12 Hour int64
## 13 DayOfTheWeek int64
## 14 WeekDay object
## 15 MonthName object
## 16 DayOfTheYear int64
## 17 MonthAbbrev object
## 18 Quarter string
## dtypes: datetime64[ns](1), float64(4), int64(7), object(6), string(1)
## memory usage: 613.5+ MB
print('NAs by Attribute: ')
## NAs by Attribute:
df.isna().sum()
## ID 0
## Severity 0
## Start_Time 0
## Start_Lat 0
## Start_Lng 0
## State 0
## Temperature(F) 89900
## Visibility(mi) 98668
## Weather_Condition 98383
## Year 0
## Month 0
## Day 0
## Hour 0
## DayOfTheWeek 0
## WeekDay 0
## MonthName 0
## DayOfTheYear 0
## MonthAbbrev 0
## Quarter 0
## dtype: int64
This analysis allows for a closer look into accidents in the Contiguous U.S. from 2016 to 2020 so that trends and potential determinants of accident risk can be identified. The bar plot for accident counts by state reveals that there is a significant long-tail effect. The top six states, especially California, have dramatically higher accident counts compared to the rest of the states which all fall into a relatively low range of accident counts. This is emphasized by the difference in the mean number of accidents per state for all states, 86,378.39, and for the top ten states, 279,890.5. However, the dual axis bar plot indicates that total accident count is not correlated with accident severity. The plot shows that average accident severity had an almost uniform distribution across the top ten states despite large differences in total accident counts.
The line plots for accidents by hour by day of week and accidents by month by year as well as the pie chart for accidents by quarter and month illustrate interesting trends in accidents over time. They show that over the course of each year accidents per month trend upwards which means that total accidents tend to increase each quarter and each year. The only year in the data set that deviated from this trend was 2020 which seems to have had a COVID effect causing accident counts to drop below 2016 levels in the summer before rising significantly higher than any previous levels in the Fall and early Winter. Furthermore, there appears to be a rush hour and weekend effect. Accident frequency is at its highest when people are commuting, and evidently rushing, to work in the morning on week days. This is closely followed by a slightly lower peak in accident frequency when people are commuting home on weekdays. This trend is not present on weekends and accident frequency is much lower overall on weekend, possibly due to less traffic from people driving to and from work.
These findings call for additional analysis but the information gathered so far identifies some specific issues to focus on for the effort to reduce the likelihood of accidents. New or altered traffic laws may be necessary in areas with higher accident counts, such as California. This could include stricter laws with regard to driving while distracted by things likes phones. It is also possible that places with high accident counts require greater investment into infrastructure to accommodate and adapt to a greater volume of traffic.
Just as data can be used to answer questions and explain trends, it can also create new questions for consideration. Here are some new questions that could be explored to continue the analysis of U.S. accidents.
Why do states like California have significantly more accidents than other states?
Why did accidents not decrease until several months into the pandemic in 2020?
Why did accidents rapidly increase in the Fall of 2020?
Why is the number of accidents in the Contiguous U.S. consistently rising?
This data set contains a significant and complete amount of data on accidents in the contiguous U.S. from 2016 to 2020. It also contains a large number of attributes for each accident including coordinates for the location and the date and time. This information for such a complete data set allowed for a rather in depth analysis of accidents in the U.S. overtime and of how time of day, week, or year can impact the likelihood of an accident. However, data for many of the other attributes was collected in such a way that it would require a great deal of additional work for it to be usable in an analysis. For instance, weather conditions were not recorded in any standardized fashion resulting in conditions like rain being described in various ways with different descriptive words. Many accidents were also missing data for attributes like weather conditions and temperature making in even more challenging to conduct an analysis on the relationship between these variables and accident counts or severity.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.