Data Visualization: Python

Dataset of Collisions in Takoma Park, MD from 2015 - 2023

The data originally has 965 observations of collisions with 40 features. I chose to specifically look at 6 features: ‘Crash Date/Time’, ‘Hit/Run’, ‘Surface Condition’, ‘First Harmful Event’, ‘Latitude’, and ‘Longitude’.

After removing missing values, I am left with 920 observations.

Exploration of Data

Bar Chart of Collisions by Year

As shown below, the number of collisions was on a steady rise up until 2018 where there was a slight decline, then again in 2020 when the Covid Pandemic started. After 2020, we see another steady rise of collisions as offices demanded back-to-office policies. Because I did not live in Takoma Park in before 2021, I wonder if the drop in 2018 was due to some installation of street lights. Without further research, this will remain a mystery.

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/ashkl/anaconda3/Library/plugins/platforms'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import folium
from matplotlib.ticker import FuncFormatter
import seaborn as sns

warnings.filterwarnings("ignore")

path = "C:/Users/ashkl/OneDrive/Documents/Loyola/data visualization/Takoma_Park_Crash_Incidents_Data.csv"
df = pd.read_csv(path, usecols= ['Crash Date/Time', 'Hit/Run', 'Surface Condition', 'First Harmful Event', 'Latitude', 'Longitude'])

#remove na values
df = df[df['Surface Condition'].notna() & df['First Harmful Event'].notna()]

#convert date/time
df['Crash Date/Time'] = pd.to_datetime(df['Crash Date/Time'], format = '%m/%d/%Y %H:%M') 
df.replace(('Yes', 'No'), (1, 0), inplace=True)

#create the year variable in the main df
df['Year'] = df['Crash Date/Time'].dt.year

#make the df for the collisions by year bar chart
crash_by_year = df.groupby(['Year', 'First Harmful Event'])['Year'].count().reset_index(name = 'count')
crash_by_year = pd.DataFrame(crash_by_year)
crash_year_count = df.groupby(['Year'])['Year'].count().reset_index(name = 'count')
crash_year_count = pd.DataFrame(crash_year_count)

#plot the bar chart
x_axis = crash_year_count['Year']
y_axis = crash_year_count['count']

fig, ax = plt.subplots()
bar_container = ax.bar(x_axis, y_axis, color ='lightgreen')

labels = ['2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'] 

plt.xticks(x_axis, labels)

plt.title("Count of Collisions per Year", fontsize = 14)
plt.xlabel("Year", fontsize = 12)
plt.ylabel("Count of Collisions", fontsize = 12)
plt.ylim(0,150)

ax.bar_label(bar_container, fmt='{:,.0f}');
plt.show()

Scatter Plot of Collisions based on Surface Conditions by Year

I was wondering if the surface condition had anything to do with the number of collisions. From this graph, we can determine that most collisions actually happen in dry conditions, followed (with a large gap) by wet driving conditions.

#create the DataFrame for the scatter plot
surface_condition_by_year = df.groupby(['Year', 'Surface Condition'])['Year'].count().reset_index(name = 'count')
surface_condition_by_year = pd.DataFrame(surface_condition_by_year)

#plot the scatter plot
plt.scatter(surface_condition_by_year['Surface Condition'], surface_condition_by_year['Year'], marker = 'X', cmap = 'inferno', c = surface_condition_by_year['count'], s = surface_condition_by_year['count'], edgecolors= 'black')
plt.title("Amount of Collisions by Year by Road Condition", fontsize = 14)
plt.xlabel("Road Condition", fontsize = 12)
plt.ylabel("Year", fontsize = 12)
#cbar = plt.colorbar()
#cbar.set_label("Number of Collisions", rotation = 270, fontsize = 12, color = 'black', labelpad = 30)
plt.show()

Map of Collisions as colored by Collision Type

I had wondered if there was any correlation between the number of collisions in my city as to where they occurred. As shown below, besides there being more collisions on major roads (New Hampshire Ave and University Blvd, East-West Highway, and etc), there was no huge conglomerate of crashes happening except for maybe one area. If you take a close look near the upper-right where University Blvd and Carroll Ave meet, there is a rather large accumulation of car crashes. This area is heavily used due to the school nearby, but there may be another reason. Near this intersection is Merrimac Dr - this street has no streetlights and people routinely use it to cross University Blvd which is a major 6-lane road.

The colors here represent the object/being that the car initially hit: Green = Bicycle, Blue = Fixed Object, Orange = Other Vehicle, Yellow = Parked Vehicle, and Red = Pedestrian.

#create the necessary DataFrames:
keep = ['OTHER VEHICLE', 'PARKED VEHICLE', 'FIXED OBJECT', 'PEDESTRIAN', 'BICYCLE']
map_df = df[df['First Harmful Event'].isin(keep)] 
crash_map_df = map_df.groupby(['First Harmful Event']).size().reset_index(name = 'Count')

#plot it! The center is actually the Takoma Park Police Station/city hall area.

center_of_map = [38.98164529888358, -77.01052686517662] 
my_map = folium.Map(location = center_of_map,
                   zoom_start = 13,
                   width = '90%',
                   height = '100%',
                   left = '5%',
                   right = '5%',
                   top = '0%') 
tiles = ['CartoDB Positron', 'openstreetmap']
for tile in tiles:
    folium.TileLayer(tile).add_to(my_map)

folium.LayerControl().add_to(my_map)

for index, row in map_df.iterrows():
    if row['First Harmful Event'] == 'BICYCLE':
        color = 'green'
    elif row['First Harmful Event'] == 'FIXED OBJECT':
        color = 'blue'  
    elif row['First Harmful Event'] == 'OTHER VEHICLE':
        color = 'orange' 
    elif row['First Harmful Event'] == 'PARKED VEHICLE':
        color = 'yellow' 
    else:
        color = 'red' 
    folium.Circle(location=[row['Latitude'], row['Longitude']],
                  radius=50,
                  color=color,
                  fill=True,
                  fill_color=color,
                  fill_opacity=0.5).add_to(my_map)


my_map.save("C:/Users/ashkl/OneDrive/Documents/Loyola/data visualization/takoma_park_collision_map.html")

Heatmap of Collisions by Year and by Month

My next curiosity was that maybe more collisions would happen in certain months. Possibly when school was out, there may be less collisions. As shown below, we can conclude that my hypothesis was incorrect. Certainly, in some cases, there seems to be more collisions in Sept-Dec, but this was not always the case. More often, the collisions appear to happen at random and may not be correlated by month, per say. It is quite possible that some other force is at work here that we’re not seeing in this data.

#add the month, hour, and weekday to the main df:
df['Month'] = df['Crash Date/Time'].dt.month
df['Hour'] = df['Crash Date/Time'].dt.hour
df['Weekday'] = df['Crash Date/Time'].dt.strftime('%a')

#create the necessary DataFrames for the specific visualization:
crash_by_month_by_year = df.groupby(['Year', 'Month'])['Year'].count().reset_index(name = 'count')
crash_by_month_by_year = pd.DataFrame(crash_by_month_by_year)
heatmap_df = pd.pivot_table(crash_by_month_by_year, index = 'Year', 
                            columns= 'Month', values = 'count')
                            
#plot it! the best part, we can agree.
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(1, 1, 1)


ax = sns.heatmap(heatmap_df, linewidth = 0.2, annot=True, cmap = 'cool', fmt=',.0f',
                square=True, annot_kws={'size': 11})
ax.invert_yaxis()

plt.title('Heatmap of Takoma Park, MD Car Crashes by Year and Month', fontsize=18, pad=15)
plt.xlabel('Crash Month', fontsize=14, labelpad=10)
plt.ylabel('Crash Year', fontsize=14, labelpad=10)
plt.yticks(rotation = 0, size=11)

plt.xticks(size=11)

cbar = ax.collections[0].colorbar
cbar.set_label('Number of Crashes', rotation=270, fontsize=14, color='black',
              labelpad=25)

plt.show()

Collisions by Hour and Colored by Weekday

The last thing I wanted to investigate for this project was to check if there was a pattern between number of collisions by the hour and weekday. As you can see below, the number of crashed by weekday varies widely, but there is a clear pattern of crashes by the hour. According to this data, the safest time to drive is between 9PM and 5:30AM. The pattern of rush hours is shown between 5:30AM and 10:30AM and again beginning from 1PM until 6PM. It is interesting to see that even though there is a dropoff of activity arounnd 11AM, the number of coliisions steadily rises until the evening rush hour begins.

#create the necessary DataFrames:
crash_by_day = df.groupby(['Hour', 'Weekday'])['Weekday'].count().reset_index(name = 'count')

#Plot it!
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(1, 1, 1)

mycolors = {'Mon':'red', 
            'Tue':'orange', 
            'Wed':'black',
            'Thu':'green', 
            'Fri':'blue', 
            'Sat':'purple',
            'Sun':'grey'}

for key, grp in crash_by_day.groupby('Weekday'): 
    grp.plot(ax = ax, kind = 'line', x='Hour', y = 'count', color = mycolors[key], label = key, marker = '8')

plt.title('Total Collisions by Hour', fontsize = 18)
ax.set_xlabel('Hour', fontsize = 18)
ax.set_ylabel('Total Collisions', fontsize = 18, labelpad = 20)
ax.tick_params(axis = "x", labelsize = 14, rotation = 0)
ax.tick_params(axis = "y", labelsize = 14, rotation = 0)
ax.set_xticks(np.arange(24))

handles, labels = ax.get_legend_handles_labels() 
handles = [handles[1], handles[5], handles[6], handles[4], handles[0], handles[2], handles[3]] #captures the position they are currently in and reshuffles the colors/labels
labels = [labels[1], labels[5], labels[6], labels[4], labels[0], labels[2], labels[3]]
plt.legend(handles, labels, loc = 'best', fontsize=14, ncol=1)
plt.show()

Conclusion

There are some patterns of rush-hour and some highly-used areas affecting the number of collisions, but no other distinct patterns were found in this set of explorations. The trend of collisions per year was interesting: there was clearly something done around 2018 that reduced the number of collisions, and we can see the affect that the Covid Pandemic had on the number of collisions as well.

It was surprising to see that there was no easily-seen pattern of collisions regarding the year and month. I had thought for sure there would be a pattern of cold months, wet months, or even a pattern of kids being in-school vs out, but nothing stood out.

It was also interesting to find that most of the collisions that occurred were actually in dry conditions. Perhaps people are more cautious when it’s raining/snowing. In my own experience, this may be true. I was driving to work around 6:30AM - for some parts of the year, it is dark at this time. I observed that people tend to drive slower in the dark than in the sunshine.

Data Visualization: Python_Project

Ashley Kleen

2024-04-09