Analysis of U.S. Accidents (2016-2020)

Introduction

This project examines country-wide car accident data for the Contiguous United States from 2016 to 2020. Analysis of car accident data allows for trends to be identified that might indicate causes or predictors of car accidents and their severity. Determining what factors can influence the likelihood and severity of a car accident as well as what types of locations might be hotpots for accidents may allow for steps to be taken to reduce risk of accidents. This information may be useful in the creation or alteration of traffic laws or for predicting the probability of an accident at a given location given a specific time of day or year.

Source: https://www.kaggle.com/sobhanmoosavi/us-accidents/data

Dataset

This data set was created using APIs that broadcast traffic data collected from numerous sources including U.S. and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The data includes 4,232,541 accident records from the contiguous U.S., each of which is represented by a row in the data set. The data set also has 49 columns to identify different attributes and traits of each accident. The columns that were used for this analysis include information about the date and time of the accident, the location of the accident (latitude and longitude), and the severity of the accident.

Findings

Accidents by State

These plots show us that there is a long-tail effect in the total number of accidents per state from 2016 to 2020. The top six states with the most accidents seem to have had significantly more accidents within this time period than the other states in the data set. California, the number one state, in particular had drastically more accidents than the other states with 972,585 accidents while the mean number of accidents for all 49 states is only 86,378.39. This significant skew in accidents per state is especially apparent when comparing the mean of 86,378.39 for all 49 states to the much higher mean of 279,890.5 for the top 10 states.

Key Takeaway: The top states with the highest number of total accidents have significantly more accidents than the majority of the states in the data set. This may be due to factors including population size, population density, or traffic laws and it calls for a closer look into potential causes.


import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'c:/ProgramData/Anaconda3/Library/plugins/platforms'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium

path = "U:/"
filename = path + 'US_Accidents_Dec20.csv'

df = pd.read_csv(filename)
df = pd.read_csv(filename, usecols = ['ID','State', 'Start_Lat', 'Start_Lng', 'Severity', 'Start_Time', 'Temperature(F)', 'Weather_Condition', 'Visibility(mi)'])

df['Start_Time'] = pd.to_datetime(df['Start_Time'], format = '%Y-%m-%d %H:%M:%S')

df['Year'] = df['Start_Time'].dt.year
df['Month'] = df['Start_Time'].dt.month
df['Day'] = df['Start_Time'].dt.day
df['Hour'] = df['Start_Time'].dt.hour
df['DayOfTheWeek'] = df['Start_Time'].dt.dayofweek
df['WeekDay'] = df['Start_Time'].dt.strftime('%A')
df['MonthName'] = df['Start_Time'].dt.strftime('%B')
df['DayOfTheYear'] = df['Start_Time'].dt.dayofyear

x = df.groupby(['State']).agg({'State':['count'], 'Severity':['mean']}).reset_index()
x.columns = ['States', 'AccidentCount', 'AvgSeverity']

x = x.sort_values('AccidentCount', ascending = False)
x.reset_index(inplace = True, drop=True)

def pick_colors_according_to_mean_count(this_data):
    colors=[]
    avg = this_data.AccidentCount.mean()
    for each in this_data.AccidentCount:
        if each > avg*1.05:
            colors.append('midnightblue')
        elif each <avg*0.95:
            colors.append('steelblue')
        else:
            colors.append('turquoise')
    return colors
    
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter

bottom1 = 0
top1 = 49
d1 = x.loc[bottom1:top1]
my_colors1 = pick_colors_according_to_mean_count(d1)
mean_lab1 = "{:,}".format(round(d1.AccidentCount.mean(),2))

bottom2 = 0
top2 = 9
d2 = x.loc[bottom2:top2]
d2 = d2.sort_values('AccidentCount', ascending=True)
d2.reset_index(inplace=True, drop=True)
my_colors2 = pick_colors_according_to_mean_count(d2)
mean_lab2 = "{:,}".format(round(d2.AccidentCount.mean(),2))

Above = mpatches.Patch(color='midnightblue', label='Above Average')
At = mpatches.Patch(color='turquoise', label='Within 5% of the Average')
Below = mpatches.Patch(color='steelblue', label='Below Average')

fig1 = plt.figure(figsize=(18, 16))
fig1.suptitle('Frequency of Accidents Analysis by State:\n All ' + str(top1) + ' and Top ' + str(top2),
              fontsize= 18, fontweight='bold')

ax1 = fig1.add_subplot(2, 1, 1)
ax1.bar(d1.States, d1.AccidentCount, label='Accident Count', color = my_colors1)

ax1.legend(handles=[Above, At, Below], fontsize=14)
plt.axhline(d1.AccidentCount.mean(), color = 'black', linestyle = 'dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.set_title('All '+ str(top1) +' States', size=20)
ax1.set_xlabel('State', fontsize = 16)
ax1.set_ylabel('Accident Count (Millions)', fontsize = 16)
ax1.text(top1-10, d1.AccidentCount.mean()+20000, 'Mean = ' + str(mean_lab1), rotation = 0, fontsize = 14)
ax1.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:('%1.1fM')%(x*1e-6)))

ax2 = fig1.add_subplot(2, 1, 2)
ax2.barh(d2.States, d2.AccidentCount, color = my_colors2)

for row_counter, value_at_row_counter in enumerate(d2.AccidentCount):
    if value_at_row_counter > d2.AccidentCount.mean()*1.05:
        color = 'midnightblue'
    elif value_at_row_counter < d2.AccidentCount.mean()*0.95:
        color = 'steelblue'
    else:
        color = 'turquoise'
    ax2.text(value_at_row_counter+5000, row_counter, str("{:,}".format(value_at_row_counter)), color = color, size=12, fontweight='bold',
            ha='left', va='center', backgroundcolor = 'white')
plt.xlim(0, d2.AccidentCount.max()*1.1) # increasing width of fig

ax2.legend(loc = 'lower right', handles = [Above, At, Below], fontsize = 14)

plt.axvline(d2.AccidentCount.mean(), color = 'black', linestyle = 'dashed')
ax2.text(d2.AccidentCount.mean()+3000, 0, 'Mean = ' + str(mean_lab2), rotation=0, fontsize=14)

ax2.set_title('Top ' + str(top2+1) + ' States', size = 20)
ax2.set_xlabel('Accident Count (Millions)', fontsize = 16)
ax2.set_ylabel('State', fontsize = 16)
plt.xticks(fontsize=14)

plt.yticks(fontsize=14)

ax2.xaxis.set_major_formatter(FuncFormatter(lambda x, pos:('%1.1fM')%(x*1e-6)))

fig1.subplots_adjust(hspace=0.35)

plt.show()

Accident Count and Average Severity

The dual axis plot below indicates that, at least for the top 10 states, average accident severity is consistently between about 2 and 2.5 on a scale with 1 being the least severe and 4 being the most severe. This is a rather consistent level of average severity despite the large differences in the number of accidents per state between these states. To emphasize the variation in accidents per state, it is worth noting that the number 1 state, California, has an accident count of 972.6K while the number 10 state, Oregon, has an accident count of only 108.4K.

Key Takeaway: The frequency in which accidents occur in a state does not seem to be correlated with the average severity of accidents in a state. This could mean that the factors that determine the frequency and likelihood of accidents are different from the factors that influence that severity of accidents.


def autolabel(these_bars, this_axis, place_of_decimals, symbol, symbol2):
    for each_bar in these_bars:
        height = each_bar.get_height()
        this_axis.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01, symbol+format(height, place_of_decimals)+symbol2,
                    fontsize=11, color='black', ha='center', va='bottom')
                    
fig = plt.figure(figsize=(18,10))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = ax1.twinx()
bar_width = 0.4

x_pos = np.arange(10)
count_bars = ax1.bar(x_pos-(0.5*bar_width), (d2.AccidentCount)/1000, bar_width, color = 'steelblue', edgecolor='black', label='Accident Count')

aver_severity_bars = ax2.bar(x_pos+(0.5*bar_width), d2.AvgSeverity, bar_width, color = 'midnightblue', edgecolor='black', label='Average Severity')

ax1.set_xlabel('State', fontsize=18)
ax1.set_ylabel('Count of Accidents (Thousands)', fontsize=18, labelpad=20)
ax2.set_ylabel('Average Severity', fontsize =18, rotation=270, labelpad=20)
ax1.tick_params(axis='y', labelsize=14)
ax2.tick_params(axis='y', labelsize=14)

plt.title('Accident Count and Average Severity Analysis:\n States with Top 10 Highest Accident Counts', fontsize=18)
ax1.set_xticks(x_pos)

ax1.set_xticklabels(d2.States, fontsize=14)

count_color, count_label = ax1.get_legend_handles_labels()
severity_color, severity_label = ax2.get_legend_handles_labels()
legend = ax1.legend(count_color + severity_color, count_label + severity_label, loc='upper left', frameon=True, ncol=1, shadow=True,
                   borderpad=1, fontsize=14)

ax2.set_ylim(0, (d2.AvgSeverity.max())*1.5)

autolabel(count_bars, ax1, '.1f', '', 'k')
autolabel(aver_severity_bars, ax2, '.2f', '', '')

plt.show()

Accidents by Hour and Day of Week

This plot illustrates two spikes in the number of accidents at the times that people commute as well as a weekend effect. The most significant spike in accidents occurs on weekdays from 7:00 AM to 8:00 AM. The second and slightly less dramatic spike occurs on weekdays from 4:00 PM to 5:00 PM. These peaks indicate that the higher amounts of traffic during rush hour to and from work contribute to higher numbers of accidents. The higher number of accidents during the morning rush hour could mean that people tend to rush to work, increasing the risk of an accident, but tend to not rush as much to get home. This trend is not present on the weekends and overall there are far fewer accidents on Saturdays and Sundays. This is likely due to a reduction in traffic when less people need to drive to work all at the same time.

Key Takeaway: Accidents seem to be most likely to occur on weekdays during the hours that people commute to and from work. By comparison, there are far fewer accidents overall on the weekends when less people need to commute.


accident_day_df = df.groupby(['Hour', 'WeekDay'])['ID'].count().reset_index(name='TotalAccidents')

from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot(1, 1, 1)

my_colors = {'Monday':'blue',
             'Tuesday':'red',
             'Wednesday':'green',
             'Thursday':'pink',
             'Friday':'orange',
             'Saturday':'gray',
             'Sunday':'brown'}

for key, grp in accident_day_df.groupby(['WeekDay']):
    grp.plot(ax=ax, kind='line', x='Hour', y='TotalAccidents', color=my_colors[key], label=key, marker='8')

plt.title('Total Accidents by Hour by Day of Week', fontsize=18)
ax.set_xlabel('Hour (24 Hour Interval)', fontsize=18)
ax.set_ylabel('Total Accidents', fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)

ax.set_xticks(np.arange(24))

handles, labels = ax.get_legend_handles_labels()
handles = [handles[1],handles[5],handles[6],handles[4],handles[0],handles[2],handles[3]]
labels  = [labels[1], labels[5],labels[6],labels[4], labels[0], labels[2], labels[3]]
plt.legend(handles, labels, loc='best', fontsize=14, ncol=1)

ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:format(int(x), ',')%(x)))

plt.show()

Accidents by Month by Year

From this line plot we are able to see that the number of accidents per year tends to increase every year. We cannot determine what causes this trend from this data but increases in the number of drivers and in the number of distractions (phones, social media, etc.) are potential contributors that may be worth considering. Additionally, there appears to be a slight increase in accidents per month as it gets later in the year. This is an interesting trend worth looking further into but it could potentially be due to travel being low at the beginning of the year and increasing later in the summer and during the holidays. This trend could also be caused by weather conditions changing throughout the year. There also appears to be an interesting COVID effect in 2020. As expected, the number of accidents per month increased relative to previous years at the beginning of 2020 but then drastically decreased below 2016 levels in July and August before rapidly increasing to a level far above numbers for any other year in November and December. This could be due to people driving less when the pandemic became more serious but then driving significantly more, and being out of practice, after several months of dealing with COVID restrictions.

Key Takeaway: Accidents per month tend to steadily increase every year and accidents per month seem to slowly increase as the year goes on. There appears to have been a COVID effect causing an unusual drop in accidents per month followed by a dramatic spike.


month_df = df.groupby(['Year', 'Month'])['ID'].count().reset_index(name='TotalAccidents')
month_df['Year'] = month_df['Year'].astype(str)

from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot(1, 1, 1)

my_colors2 = {'2016':'blue',
              '2017':'red',
              '2018':'green',
              '2019':'gold',
              '2020':'brown'}

for key, grp in month_df.groupby(['Year']):
    grp.plot(ax=ax, kind='line', x='Month', y='TotalAccidents', color=my_colors2[key], label=key, marker='8')

plt.title('Total Accidents by Month by Year', fontsize=18)
ax.set_xlabel('Month', fontsize=18)
ax.set_ylabel('Total Accidents', fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)

ax.set_xticks(np.arange(13))

ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:format(int(x), ',')%(x)))

plt.show()

Quarterly and Monthly Analysis

The nested pie chart below reinforces the findings of the previous line plot that the frequency of accidents seems to increase gradually throughout each year. This trend is apparent from the slight increase in the percentage of total accidents accounted for in each quarter. 20.78% of the total accidents occurred in Quarter 1 of their respective years while 34.98% of the total accidents occurred in Quarter 4. By looking at the percentage of total accidents from month to month we see that there is greater deviation from this pattern but the trend is still there. There are increases and decreases in the number of accidents from month to month but there is still an overall upward trend as seen by the difference between 7.13% of accidents occurring in January and 12.33% occurring in December.

Key Takeaway: The pie chart provides further evidence that the frequency of accidents increases from quarter to quarter each year. This follows the trend of total accidents per year increasing each year as seen in the line plot.


df['MonthAbbrev'] = df['Start_Time'].dt.strftime('%b')
df['Quarter'] = 'Quarter' + df.Start_Time.dt.quarter.astype('string')

pie_df = df.groupby(['Quarter', 'MonthAbbrev', 'Month'])['ID'].count().reset_index(name='TotalAccidents')

pie_df.sort_values(by=['Month'], inplace = True)

pie_df.reset_index(inplace = True, drop=True)

del pie_df['Month']

number_outside_colors = len(pie_df.Quarter.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4

number_inside_colors = len(pie_df.MonthAbbrev.unique())
all_color_ref_number = np.arange(number_outside_colors + number_inside_colors)

inside_color_ref_number = []
for each in all_color_ref_number:
    if each not in outside_color_ref_number:
        inside_color_ref_number.append(each)
        
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(1, 1, 1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)

all_accidents = pie_df.TotalAccidents.sum()

pie_df.groupby(['Quarter'])['TotalAccidents'].sum().plot(
    kind = 'pie', radius=1, colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1,
    wedgeprops = dict(edgecolor='w'), textprops={'fontsize':16},
    autopct = lambda p: '{:.2f}%\n({:.1f}K)'.format(p,(p/100)*all_accidents/1e3),
    startangle=90)

inner_colors = colormap(inside_color_ref_number)
pie_df.TotalAccidents.plot(
    kind = 'pie', radius=0.7, colors = inner_colors, pctdistance = 0.6, labeldistance = 0.8,
    wedgeprops = dict(edgecolor='w'), textprops={'fontsize':13},
    labels = pie_df.MonthAbbrev, 
    autopct = '%1.2f%%',
    startangle=90)

hole = plt.Circle((0,0), 0.3, fc = 'white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('Total Accidents by Quarter and Month', fontsize = 18)

ax.text(0, 0, 'Total Accidents\n' + str(round(all_accidents/1e6,2)) + 'M', size=18, ha='center', va='center')

ax.axis('equal')

plt.tight_layout()

plt.show()

Descriptive Statistics

sev_min = df['Severity'].min()
sev_max = df['Severity'].max()
avg_sev = df['Severity'].mean()
print('Minimum Severity = ', sev_min)

## Minimum Severity =  1

print('Maximum Severity = ', sev_max)

## Maximum Severity =  4

print('Average Severity = ', avg_sev)

## Average Severity =  2.3050349659932414

print()

state_min = x['AccidentCount'].min()
state_max = x['AccidentCount'].max()
avg_state = x['AccidentCount'].mean()
avg_state_sev = x['AvgSeverity'].mean()
print('Minimum Accidents per State = ', state_min)

## Minimum Accidents per State =  220

print('Maximum Accidents per State = ', state_max)

## Maximum Accidents per State =  972585

print('Average Accidents per State = ', avg_state)

## Average Accidents per State =  86378.38775510204

print('Average per State Severity = ', avg_state_sev)

## Average per State Severity =  2.3692962775199073

print()

print('Data Types: ')

## Data Types:

df.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 4232541 entries, 0 to 4232540
## Data columns (total 19 columns):
##  #   Column             Dtype         
## ---  ------             -----         
##  0   ID                 object        
##  1   Severity           int64         
##  2   Start_Time         datetime64[ns]
##  3   Start_Lat          float64       
##  4   Start_Lng          float64       
##  5   State              object        
##  6   Temperature(F)     float64       
##  7   Visibility(mi)     float64       
##  8   Weather_Condition  object        
##  9   Year               int64         
##  10  Month              int64         
##  11  Day                int64         
##  12  Hour               int64         
##  13  DayOfTheWeek       int64         
##  14  WeekDay            object        
##  15  MonthName          object        
##  16  DayOfTheYear       int64         
##  17  MonthAbbrev        object        
##  18  Quarter            string        
## dtypes: datetime64[ns](1), float64(4), int64(7), object(6), string(1)
## memory usage: 613.5+ MB

print('NAs by Attribute: ')

## NAs by Attribute:

df.isna().sum()

## ID                       0
## Severity                 0
## Start_Time               0
## Start_Lat                0
## Start_Lng                0
## State                    0
## Temperature(F)       89900
## Visibility(mi)       98668
## Weather_Condition    98383
## Year                     0
## Month                    0
## Day                      0
## Hour                     0
## DayOfTheWeek             0
## WeekDay                  0
## MonthName                0
## DayOfTheYear             0
## MonthAbbrev              0
## Quarter                  0
## dtype: int64

Conclusion

This analysis allows for a closer look into accidents in the Contiguous U.S. from 2016 to 2020 so that trends and potential determinants of accident risk can be identified. The bar plot for accident counts by state reveals that there is a significant long-tail effect. The top six states, especially California, have dramatically higher accident counts compared to the rest of the states which all fall into a relatively low range of accident counts. This is emphasized by the difference in the mean number of accidents per state for all states, 86,378.39, and for the top ten states, 279,890.5. However, the dual axis bar plot indicates that total accident count is not correlated with accident severity. The plot shows that average accident severity had an almost uniform distribution across the top ten states despite large differences in total accident counts.

The line plots for accidents by hour by day of week and accidents by month by year as well as the pie chart for accidents by quarter and month illustrate interesting trends in accidents over time. They show that over the course of each year accidents per month trend upwards which means that total accidents tend to increase each quarter and each year. The only year in the data set that deviated from this trend was 2020 which seems to have had a COVID effect causing accident counts to drop below 2016 levels in the summer before rising significantly higher than any previous levels in the Fall and early Winter. Furthermore, there appears to be a rush hour and weekend effect. Accident frequency is at its highest when people are commuting, and evidently rushing, to work in the morning on week days. This is closely followed by a slightly lower peak in accident frequency when people are commuting home on weekdays. This trend is not present on weekends and accident frequency is much lower overall on weekend, possibly due to less traffic from people driving to and from work.

These findings call for additional analysis but the information gathered so far identifies some specific issues to focus on for the effort to reduce the likelihood of accidents. New or altered traffic laws may be necessary in areas with higher accident counts, such as California. This could include stricter laws with regard to driving while distracted by things likes phones. It is also possible that places with high accident counts require greater investment into infrastructure to accommodate and adapt to a greater volume of traffic.

New Research Questions

Just as data can be used to answer questions and explain trends, it can also create new questions for consideration. Here are some new questions that could be explored to continue the analysis of U.S. accidents.

Why do states like California have significantly more accidents than other states?
- Is this linked to factors such as population size, population density, or traffic laws?
Why did accidents not decrease until several months into the pandemic in 2020?
Why did accidents rapidly increase in the Fall of 2020?
- Did people start driving more than before the pandemic after being in lock down?
- Were they out of practice from not driving as much for several months?
Why is the number of accidents in the Contiguous U.S. consistently rising?
- Are people becoming more and more distracted while driving?

Issues with the Data Set

This data set contains a significant and complete amount of data on accidents in the contiguous U.S. from 2016 to 2020. It also contains a large number of attributes for each accident including coordinates for the location and the date and time. This information for such a complete data set allowed for a rather in depth analysis of accidents in the U.S. overtime and of how time of day, week, or year can impact the likelihood of an accident. However, data for many of the other attributes was collected in such a way that it would require a great deal of additional work for it to be usable in an analysis. For instance, weather conditions were not recorded in any standardized fashion resulting in conditions like rain being described in various ways with different descriptive words. Many accidents were also missing data for attributes like weather conditions and temperature making in even more challenging to conduct an analysis on the relationship between these variables and accident counts or severity.

Sources Used in the Creation of the Data Set

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.