Motor Vehicle Collisions in NYC Analysis

Introduction

The purpose of this report is to analyze motor vehicle collision trends in the five boroughs of New York City. Unfortunately, motor collisions are a daily occurrence that can result in severe injury or death. It is important to analyze motor collision trends, as it can influence staffing requirements for first responders, enforcement of traffic laws, and the implementation of traffic safety measures to limit vehicle collisions.

Dataset

The data used for this report comes from NYC Open Data, a site with free public data published by New York City agencies and other partners. The specific dataset used contains information from all police reported motor vehicle collisions in NYC where a person was injured, killed, or there was over $1,000 worth of damage. The dataset is updated daily, and the version used for this report contains data from 2017 to April 1, 2022. Most notably for purposes of this report, the dataset provides information regarding the crash date, borough, contributing factor for the accident, number of persons injured, and number of persons killed for each reported motor vehicle collision.

Findings

The five visualizations in the tabs below are useful for drawing conclusions about when and in which borough motor vehicle collisions occur most. Additionally, the visualizations help provide an understanding of the top reasons that contribute to motor vehicle collisions, as well as the mortality trends of motor vehicle collisions in NYC.

Referring to the visualizations, Driver Inattention/Distraction is the leading cause of motor vehicle collisions in NYC. Surprisingly, driver intoxication was well below the average.

The visualizations further show that motor vehicle collisions have decreased in 2020 and 2021 compared to 2017-2019 in all five boroughs. However, Brooklyn and Queens are still the two boroughs with the most motor vehicle collisions. Following this trend, Brooklyn and Queens also have the greatest number of both injuries and deaths from motor collisions. Notably, Brooklyn and Queens are also the two boroughs with the greatest population.

Additionally, motor vehicle collisions are greatest at the start of the work week, Monday, in Brooklyn. Trends also show that there has been an increase in collisions that result in injury and death in the summer months, with a steep decrease in the months of April and May.

Top Reasons

This tab shows a horizontal bar chart that displays the top 20 reasons for motor collisions in NYC. A mean line is displayed for the user to easily see which reasons were above or below the average as contributing factors for motor collisions.

When analyzing this visualization, we can see that Driver Inattention/Distraction, Failure to Yield Right-of-Way, and Backing Unsafely were the top 3 reasons for motor collisions. Because Driver Inattention/Distraction is the contributing factor that is most significantly above the average, it would be useful to see a further breakout of this category. For example, were the drivers distracted because they were texting, eating, reading the newspaper, etc. Backing Unsafely may be another contributing factor that changes with the progression of vehicle safety measures, such as back up cameras.

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'c:/ProgramData/Anaconda3/Library/plugins/platforms'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter

path = "//apporto.com/dfs/LOYOLA/Users/srsavickis_loyola/Documents/"
filename = path + 'Motor_Vehicle_Collisions3.csv'

df= pd.read_csv(filename,usecols=['CRASH DATE','BOROUGH','CONTRIBUTING FACTOR VEHICLE 1','NUMBER OF PERSONS INJURED','NUMBER OF PERSONS KILLED'])

df['CRASH DATE']=pd.to_datetime(df['CRASH DATE'], format ='%m/%d/%Y')
df['Year']=df['CRASH DATE'].dt.year
df['Month']=df['CRASH DATE'].dt.month
df['Weekday']=df['CRASH DATE'].dt.strftime('%a')
df['MonthName']=df['CRASH DATE'].dt.strftime('%b')

df=df[df['BOROUGH'].notna()]
df=df[df['Month']!=0]
df=df[df['Year']!=0]
df=df[df['CONTRIBUTING FACTOR VEHICLE 1'].notna()]
df=df[df['NUMBER OF PERSONS INJURED'].notna()]
df=df[df['NUMBER OF PERSONS KILLED'].notna()]

x=df.groupby(['CONTRIBUTING FACTOR VEHICLE 1']).agg({'CONTRIBUTING FACTOR VEHICLE 1':['count']}).reset_index()

x.columns=['Crash_Reason','Count']

badtags=x['Crash_Reason'].str.contains('1|80')
badrows=x[x['Crash_Reason'].str.contains('80|1')]
deleterows=badrows
a=deleterows.Count.sum()
x=x[-x['Crash_Reason'].isin(deleterows.Crash_Reason)]

x=x.sort_values('Count', ascending=False)
x.reset_index(inplace=True,drop=True)


def pick_colors_according_to_mean_count(this_data):
    colors=[]
    avg=this_data.Count.mean()
    for each in this_data.Count:
        if each>avg*1.01:
            colors.append('red')
        elif each < avg*.99:
            colors.append('green')
        else:
            colors.append('orange')
    return colors
  
bottom2=1
top2=20
d2=x.loc[bottom2:top2]
d2=d2.sort_values('Count', ascending=True)
d2.reset_index(inplace=True,drop=True)
mycolors2=pick_colors_according_to_mean_count(d2)

above=mpatches.Patch(color='red',label='Above Average')
at=mpatches.Patch(color='orange',label='Within 1% of Average')
below=mpatches.Patch(color='green',label='Below Average')

fig=plt.figure(figsize=(24, 18))
ax1=fig.add_subplot(1,1,1)
ax1.barh(d2.Crash_Reason,d2.Count,color=mycolors2)

for row_counter, value_at_row_counter,  in enumerate(d2.Count):
    if value_at_row_counter > d2.Count.mean()*1.01:
        color ='red'
    elif value_at_row_counter < d2.Count.mean()*.99:
        color='green'
    else:
        color = 'orange'    
    ax1.text(value_at_row_counter+2800, row_counter, '{:,}'.format(value_at_row_counter),color=color, size=12, fontweight='bold',
            ha='left',va='center',backgroundcolor='white')
plt.xlim(0,d2.Count.max()*1.1)

ax1.legend(loc='lower right', handles=[above,at,below], fontsize=14)
plt.axvline(d2.Count.mean(),color='black', linestyle='dashed')
ax1.text(d2.Count.mean()+1500,0,'Mean =' + str("{:,.2f}".format(d2.Count.mean())),rotation=0, fontsize=14)

ax1.set_title('Top ' + str(top2)+ ' Crash Reasons', size=20)
ax1.set_xlabel('Crash Reason Count', fontsize=16)
ax1.set_ylabel('Crash Reason',fontsize=16)

plt.show()

Collisions by Borough

This tab shows a scatterplot to display the total number of motor vehicle collisions by Borough through 2017-2021. From 2017-2019, Brooklyn, Manhattan, and Queens were the boroughs with the greatest number of motor vehicle collisions. However, across all five boroughs from 2020-2021, motor collisions have decreased.


x=df.groupby(['Year','BOROUGH'])['Year'].count().reset_index(name='Count')
x=pd.DataFrame(x)

x=x.loc[~x['Year'].isin(range(2022,99999))]

x['counthundred']=round(x['Count']/100,0)

plt.figure(figsize=(16,10))
plt.scatter(x['BOROUGH'], x['Year'],marker='8',cmap='viridis',c=x['counthundred'],
            s=x['counthundred'],edgecolors='black')

plt.title('Motor Collisions by NYC Borough', fontsize=18)
plt.xlabel('Borough',fontsize=14)
plt.ylabel('Year',fontsize=14)

cbar=plt.colorbar()
cbar.set_label('Number of Motor Collisions',rotation=270,fontsize=14,color='black',labelpad=30)

mycolorbarticks=[*range(100,int(x['counthundred'].max()),100)]
cbar.set_ticks(mycolorbarticks)

mycbtickslab=[*range(10000,int(x['Count'].max()),10000)]
mycbtickslab=['{:,}'.format(each) for each in mycbtickslab]
cbar.set_ticklabels(mycbtickslab)

my_ytick=[*range(x['Year'].min(),x['Year'].max()+1,1)]
plt.yticks(my_ytick, fontsize=14, color='black')

plt.xticks(fontsize=14,color='black')

plt.show()

Brooklyn Accidents

This tab shows a multiple line plot to display the total number of motor vehicle collisions in the borough with the greatest number of motor vehicle collisions, Brooklyn, by month and day of the week.

When analyzing this visualization, we can see there are the greatest amount of motor vehicle collisions occur at the start of the week, Monday, and the least amount occur on Wednesday and Thursday. Additionally, a significant trend seen across all days of the week is a spike in motor vehicle accidents in the month of March, which is followed by a decline in the month of April.


df2=df.groupby(['MonthName','Month','BOROUGH','Weekday'])['MonthName'].count().reset_index(name='CountofAccidents')
df2=df2[df2['BOROUGH']== 'BROOKLYN']

del df2['BOROUGH']
df2.sort_values(by=['Month'], inplace=True)
fig=plt.figure(figsize=(18,10))
ax=fig.add_subplot(1,1,1)

my_colors = {'Mon':'blue',
            'Tue':'red',
            'Wed':'green',
            'Thu':'gray',
            'Fri':'purple',
            'Sat':'gold',
            'Sun':'brown'}

ax.set_xticks(np.arange(12))

for key, grp in df2.groupby(['Weekday']):
    grp.plot(ax=ax, kind='line', x='MonthName', y='CountofAccidents', color=my_colors[key], label=key, marker='8')

    
plt.title('Total Accidents in Brooklyn By Month and Day of Week',fontsize=18)
ax.set_xlabel('Month', fontsize=18)
ax.set_ylabel('Total Accidents', fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)


handles,labels =ax.get_legend_handles_labels()
handels =[handles[1],handles[5],handles[6],handles[4],handles[0],handles[2],handles[3]]
labels=[labels[1],labels[5],labels[6],labels[4],labels[0],labels[2],labels[3]]
plt.legend(handles,labels,loc='best', fontsize=14)

ax.yaxis.set_major_formatter( FuncFormatter(lambda x,pos: format(int(x), ',')))

plt.show()

Killed vs. Injured

This tab shows a dual axis bar chart to compare the amount of motor vehicle collisions that resulted in death versus injury in each of the five boroughs. The left y axis, as well as the bars colored in green, show the number of injuries from vehicle collisions in each borough. The right y axis, as well as the bars colored in red, show the number of deaths from vehicle collisions in each borough.

When analyzing the visualization, it is apparent that the boroughs with the most victims injured also had the most victims killed. As a result, Staten Island had the least amount of injuries and mortality from motor vehicle collisions. On the other hand, Brooklyn had the greatest amount of injuries and mortality from motor vehicle collisions.


df_inj=df.groupby(['BOROUGH'])['NUMBER OF PERSONS INJURED'].sum().reset_index(name='CountInj')
df_inj=pd.DataFrame(df_inj)

df_killed=df.groupby(['BOROUGH'])['NUMBER OF PERSONS KILLED'].sum().reset_index(name='CountKilled')
df_killed=pd.DataFrame(df_killed)

def autolabel(these_bars, this_axis, comma):
    for each_bar in these_bars:
        height = each_bar.get_height()
        this_axis.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01,format(height,comma),
                        fontsize=11, color='black',ha='center',va='bottom')
fig = plt.figure(figsize=(16,10))
ax1=fig.add_subplot(1,1,1)
ax2=ax1.twinx()
bar_width=0.4

x_pos=np.arange(5)
inj_bars=ax1.bar(x_pos-(.5*bar_width), df_inj.CountInj, bar_width, color='green', edgecolor='black',label='Count of Persons Injured')
ax1.set_xlabel('Borough',fontsize=18)
ax1.set_ylabel('Count of Persons Injured', fontsize=18, labelpad=20)
ax2.set_ylabel('Count of Persons Killed', fontsize=18, rotation=270, labelpad=20)

ax1.tick_params(axis='y', labelsize=14)
ax2.tick_params(axis='y', labelsize=14)

killed_bars=ax2.bar(x_pos+(.5*bar_width), df_killed.CountKilled, bar_width, color='red', edgecolor='black',label='Count of Persons Killed')
plt.title('Count of Persons Killed vs. Injured by NYC Borough', fontsize=18)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(df_inj.BOROUGH, fontsize=14)


countinj_color, countinj_label = ax1.get_legend_handles_labels()
countkilled_color, countkilled_label = ax2.get_legend_handles_labels()
legend = ax1.legend(countinj_color + countkilled_color, countinj_label + countkilled_label, loc = 'upper left', frameon=True, ncol=1, shadow=True,
                   borderpad=1, fontsize=14)
ax1.set_ylim(0, df_inj.CountInj.max()*1.5)

ax1.yaxis.set_major_formatter( FuncFormatter(lambda x,pos: format(int(x), ',')))

autolabel(inj_bars,ax1,',.0f')
autolabel(killed_bars,ax2,',.0f')

plt.show()

Killed and Injured (2020)

This tab shows a nested pie chart to compare the number of injuries and deaths from motor vehicle collisions in 2020 by quarter and month. While quarter 1 and quarter 4 had an almost equal percent of injuries and deaths, there is a greater discrepancy between quarter 2, which had the lowest percent, and quarter 3, which had the highest percent. Most of the months in 2020 averaged around the same percent of injuries and deaths; however, April and May show significantly lower percentages, at around 3% for April and 5.5% for May. Notably, in the multiple line plot depicted for Brooklyn’s motor collisions, April was also the month with a sharp decrease in motor vehicle collisions.

df5=df.copy()

del df5['BOROUGH']
del df5['CONTRIBUTING FACTOR VEHICLE 1']

df5['SumInjKilled']=df5['NUMBER OF PERSONS INJURED']+df5['NUMBER OF PERSONS KILLED']
df5=pd.DataFrame(df5)

df5=df5.groupby(['CRASH DATE','Year','MonthName', 'Month'])['SumInjKilled'].sum().reset_index(name='SumInjKilled')

df5=df5[df5['Year']== 2020]

df5['Quarter']='Quarter '+df5['CRASH DATE'].dt.quarter.astype('string')
pie_df=df5.groupby(['Quarter','MonthName', 'Month'])['SumInjKilled'].sum().reset_index(name='SumInjKilled')

pie_df.sort_values(by=['Month'], inplace=True)

pie_df.reset_index(inplace=True, drop=True)

del pie_df['Month']
pie_df

number_outside_colors=len(pie_df.Quarter.unique())
outside_color_ref_number=np.arange(number_outside_colors)*4

number_inside_colors=len(pie_df.MonthName.unique())
all_color_ref_number=np.arange(number_outside_colors+ number_inside_colors)

insidecolorref =[]

for each in all_color_ref_number:
    if each not in outside_color_ref_number:
        insidecolorref.append(each)


fig=plt.figure(figsize=(10,10))
ax=fig.add_subplot(1,1,1)

colormap=plt.get_cmap("tab20c")
outercolors=colormap(outside_color_ref_number)

allinjkilled=pie_df.SumInjKilled.sum()

pie_df.groupby(['Quarter'])['SumInjKilled'].sum().plot(
    kind='pie', radius=1, colors=outercolors, pctdistance=0.85, labeldistance=1.1,
    wedgeprops = dict(edgecolor='white'),textprops=dict(fontsize=18),
    autopct=lambda p:'{:.2f}%\n({:,.0f})'.format(p,(p/100)*allinjkilled),
    startangle=90)

innercolors=colormap(insidecolorref)
pie_df.SumInjKilled.plot(
    kind='pie', radius=0.7, colors=innercolors, pctdistance=0.55, labeldistance=0.8,
    wedgeprops = dict(edgecolor='white'),textprops=dict(fontsize=13),
    labels= pie_df.MonthName,
    autopct= '%1.2f%%',
    startangle=90)

hole = plt.Circle((0,0),0.3, fc='white')
fig1=plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('Total Persons Killed and Injured in Motor Collisions\n By Quarter and by Month (2020)',fontsize=18)

ax.text(0,0,'Total Persons\n Injured and Killed\n' + str("{:,.0f}".format(allinjkilled)),size=16, ha='center',va='center')

ax.axis('equal')

plt.tight_layout()


plt.show()

Conclusion

After analyzing the visualizations, there are both positive and negative takeaways. First, the decrease in motor vehicle collisions in 2020 and 2021 shows that actions taken to increase vehicle and traffic safety have had measurable success. However, the drastic amount of motor collisions caused by driver inattention/distraction must be further investigated. Using such a broad and not specific category is unhelpful in pinpointing the most definite cause of motor collisions in NYC. Furthermore, it could be noteworthy to investigate the reasons behind such a significant decrease in motor collisions in April. I would have expected there to be less collisions in the summer months when people are more inclined to walk, rather than the often rainy month of April. Finally, trends in mortality from motor collisions show that the boroughs with the greatest population and, therefore, motor collisions, may require greater staffing of first responders and EMTs to respond to motor collisions.