The 2022 Boston Crime dataset is sourced from Analyze Boston, a website of the city’s open data hub. The crime incident reports are provided by the Boston Police Department, detailing information such as the offense description, the district in which the incident occurred, the date and location of the incident, and more. Using this real world data, we can determine trends and patterns in Boston crime, pinpointing when and where the city is most safe or most dangerous.
# imports
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import statistics
import numpy as np
# read in data
dtype_dict = {'INCIDENT_NUMBER': 'str'}
df = pd.read_csv('C:/Users/eshinh/DataVizFiles/2022.csv', dtype=dtype_dict)
df = df.drop(columns=['OFFENSE_CODE_GROUP', 'UCR_PART'], axis=1)
# fill in NA values
df.DISTRICT.fillna("Not Available", inplace=True)
df.STREET.fillna("Not Available", inplace=True)
# format columns
df['OCCURRED_ON_DATE'] = pd.to_datetime(df['OCCURRED_ON_DATE'], format='%Y-%m-%d %H:%M:%S')
df['MONTH_NAME'] = df.OCCURRED_ON_DATE.dt.strftime('%b')
#-----------DESCRIPTIVE STATISTICS-------------
# Variables of interest: district, shooting, month, day_of_week, hour
mean = statistics.mean(df.DISTRICT.value_counts())
median = statistics.median(df.DISTRICT.value_counts())
mode = df.DISTRICT.value_counts().index[0]
inverse_mode = df.DISTRICT.value_counts().index[-3]
mode_day = df.DAY_OF_WEEK.value_counts().index[0]
inverse_mode_day = df.DAY_OF_WEEK.value_counts().index[-1]
mode_month = df.MONTH.value_counts().index[0]
inverse_mode_month = df.MONTH.value_counts().index[-1]
mode_hour = df.HOUR.value_counts().index[0]
inverse_mode_hour = df.HOUR.value_counts().index[-1]
The average number of incidents that occur in Boston is approximately 5275. The median number of incidents is 4974.5.
Interpretation: Because the mean number of incidents is greater than the median number of incidents, the data is skewed right, meaning that most of the data lies on the left-hand side of the distribution. Consequently, this reveals that the number of incidents is not symmetrically distributed across each Boston district. Rather, crime incidents more frequently occur in certain districts.
The Boston district that experiences the most incidents is B2. The Boston district that experiences the least incidents is A15.
Interpretation: District B2 is potentially the most dangerous district in Boston, while district A15 is potentially the safest.
The day of the week that experiences the most incidents is Friday. The day of the week that experiences the least incidents is Sunday.
Interpretation: Fridays are the most active day of the week, increasing the risk for incidents to occur. Sundays are the least active day of the week, decreasing the risk.
The month that experiences the most incidents is month 7. The month that experiences the least incidents is month 2.
Interpretation: Month 7 (July) is a summer month, indicating that warmer temperatures increase overall activity and consequently increase the risk of incidents. Month 2 (February) is a winter month, indicating that colder temperatures decrease overall activity and thus decrease risk.
The hour that experiences the most incidents is hour 0. The hour that experiences the least incidents is hour 4.
Interpretation: Hour 0 (midnight) appears to be the most active hour of the day, contributing to the risk of incidents occurring. Hour 4 (4 AM) appears to be the least active hour of the day.
df['SHOOTING'].describe()
The average number of incidents that involve shootings is 0.009925. Since “SHOOTING” is a binary variable, the mean tells us that the probability that a 2022 Boston incident involved a shooting is 0.9925%
The visualizations below begin by providing a broad picture of the distribution of Boston incidents across each season, further broken down by month. Next, it narrows the scope of the incident distribution by focusing on the frequency of incidents across each day of the week, highlighting days that are more or less prone to incidents. The next visualization applies the incident distribution analysis to shooting-related incidents, examining which parts of Boston and which days of the week are more inclined to experience shootings. The fourth visualization explores the incident distribution in more depth, looking at the frequency of incidents across every hour of every day. The final visualization takes a closer investigation into which hours of each day are subject to more shooting-related incidents.
quarter_to_season = {'1': 'Winter', '2': 'Spring', '3': 'Summer', '4': 'Fall'}
df['SEASON'] = df.OCCURRED_ON_DATE.dt.quarter.astype('string').map(quarter_to_season)
pie_df = df.groupby(['SEASON', 'MONTH_NAME', 'MONTH']).agg({'SEASON':['count']}).reset_index()
pie_df.columns = ['Season', 'Month', 'Month_Number', 'Count']
pie_df.sort_values(by=['Month_Number'], inplace=True)
number_outside_colors = len(pie_df.Season.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4
number_inside_colors = len(pie_df.Month.unique())
all_color_ref_number = np.arange(number_outside_colors + number_inside_colors)
inside_color_ref_number = []
for each in all_color_ref_number:
if each not in outside_color_ref_number:
inside_color_ref_number.append(each)
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(1,1,1)
colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)
all_incidents = pie_df.Count.sum()
pie_df.groupby(['Season'])['Count'].sum().plot(
kind='pie', radius=1, colors=outer_colors, pctdistance=0.85, labeldistance=1.1,
wedgeprops=dict(edgecolor='white'), textprops={'fontsize':13},
autopct = lambda p: '{:.2f}%\n({:,.0f})'.format(p,(p/100)*all_incidents), startangle=90)
inner_colors = colormap(inside_color_ref_number)
pie_df.Count.plot(
kind='pie', radius=0.7, colors=inner_colors, pctdistance=0.55, labeldistance=0.8,
wedgeprops=dict(edgecolor='white'), textprops={'fontsize':11}, labels=pie_df.Month,
autopct='%1.2f%%', startangle=90)
hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.yaxis.set_visible(False)
plt.title('Total Incidents by Season and Month', fontsize=14)
ax.text(0,0,'Total Incidents\n' + '{:,}'.format(all_incidents), ha='center' , va='center', fontsize=13)
ax.axis('equal')
plt.tight_layout()
plt.show()
The nested pie chart presented here provides a breakdown of total incidents in Boston by season and month for a given period (2022). The chart visually represents the distribution of incidents across the four seasons—Fall, Winter, Spring, and Summer—with each season further divided into its corresponding months.
Summer accounts for the largest share of incidents, representing 26.59% of the total. This may suggest that warmer weather leads to increased outdoor activities, travel, and social events, which can contribute to a higher incidence rate.
Fall follows closely behind with 24.45% of incidents, indicating a moderate rise in incidents during this season. The transition between warm and colder weather could play a role, with people spending more time outdoors before the weather turns.
Spring represents 25.83% of incidents, which is a significant portion of the total. The mild weather during this season may encourage more activities, leading to a moderate number of incidents.
Winter, with 23.13% of incidents, has the lowest share of total incidents, which could be attributed to fewer outdoor activities and a more subdued lifestyle during colder months. However, winter conditions such as snow or ice could also introduce risks that impact the types of incidents reported.
The month of July accounts for the highest number of incidents within the Summer season, with 9.11% of total incidents. This could be due to peak vacation times, outdoor festivals, or increased travel, leading to more incidents.
June and August show slightly lower but still significant percentages, with 8.8% and 8.95%, respectively. These months also likely coincide with the summer’s peak activity levels, which may explain their relatively high incident rates.
March has the highest share of incidents within the Spring season, with 8.31%, possibly linked to the early signs of increased outdoor activity and a shift in weather conditions.
December, with 7.93%, is the lowest month in terms of incidents, reflecting the quieter lifestyle typical of winter months.
The months display relatively consistent percentages of incidents throughout the year. Most months range between 7% and 9% of the total number of incidents, suggesting that incidents are distributed fairly evenly across the months, with only slight fluctuations.
def pick_colors_according_to_mean_count(this_data):
colors = []
avg = this_data.Count.mean()
for each in this_data.Count:
if each > avg*1.07:
colors.append('red')
elif each < avg*0.93:
colors.append('green')
else:
colors.append('black')
return colors
y = df.groupby(['DAY_OF_WEEK']).agg({'DAY_OF_WEEK':['count']}).reset_index()
y.columns = ['Day of Week', 'Count']
y = y.sort_values('Count', ascending=True).reset_index()
my_colors3 = pick_colors_according_to_mean_count(y)
Above = mpatches.Patch(color='red', label='Above Average')
At = mpatches.Patch(color='black', label='Within 7% of the Average')
Below = mpatches.Patch(color='green', label='Below Average')
fig = plt.figure(figsize=(18,12))
ax1 = fig.add_subplot(1,1,1)
ax1.barh(y['Day of Week'], y.Count, color=my_colors3)
for row_counter, value_at_row_counter in enumerate(y.Count):
if value_at_row_counter > y.Count.mean()*1.07:
color = 'red'
elif value_at_row_counter < y.Count.mean()*0.93:
color = 'green'
else:
color = 'black'
ax1.text(value_at_row_counter+80, row_counter, str(value_at_row_counter),
color=color, size=14, fontweight='bold', ha='left', va='center',
backgroundcolor='white')
plt.xlim(0, y.Count.max()*1.1)
ax1.legend(loc='upper left', handles=[Above, At, Below], fontsize=18, bbox_to_anchor=(1, 1))
plt.axvline(y.Count.mean(), color='black', linestyle='dashed')
ax1.text(y.Count.mean()+100, 0, 'Mean = '+ str(round(y.Count.mean(),2)),
rotation=0, fontsize=18)
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.set_title('Frequency of Incidents Analysis by Day', size=25)
ax1.set_xlabel('Incident Count', fontsize=20)
ax1.set_ylabel('Day of Week', fontsize=20)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.tight_layout()
plt.show()
The horizontal bar chart presented here shows incident data for Boston throughout the year 2022. It breaks down incidents by each day of the week, from Monday to Sunday, and compares the number of incidents on each day to the annual average of approximately 10,550 incidents. The chart visually highlights which days fall above, below, or around the mean.
Friday stands out with a notably higher-than-average number of incidents. This could be due to increased activity as people go out for social events or engage in other activities on this day. Additionally, there may be an “end-of-week” effect, where accumulated stress or fatigue from the workweek leads to a higher likelihood of incidents occurring.
In contrast, Sunday shows a lower-than-average number of incidents. This is likely because there are fewer activities, with many people opting to stay home and avoid potentially risky situations. Culturally, Sundays are often seen as a day of rest and relaxation, which may contribute to the reduction in incidents.
The data for Monday through Thursday appear consistent with the average, suggesting no significant fluctuations in incident frequency during these weekdays. These days likely follow more predictable patterns of regular activities such as work and school, which don’t contribute to an abnormal number of incidents.
bump_df = df.groupby(['DISTRICT', 'DAY_OF_WEEK'])['SHOOTING'].sum().reset_index(name='TotalShootings')
bump_df.columns = ['District', 'Weekday', 'TotalShootings']
bump_df = bump_df.pivot(index='District', columns='Weekday', values='TotalShootings')
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
bump_df = bump_df.reindex(columns=day_order)
bump_df_ranked = bump_df.rank(0, ascending=False, method='min')
bump_df_ranked = bump_df_ranked.T
bump_df_ranked = bump_df_ranked.drop(columns=['External', 'Not Available'], axis=1)
colors = ['red', 'orange', 'yellow' ,'green', 'blue', 'purple', 'pink',
'brown', 'magenta', 'black', 'cyan', 'lime']
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)
bump_df_ranked.plot(kind='line', ax=ax, marker='o', markeredgewidth=1, linewidth=6,
markersize=44, markerfacecolor='white', color=colors)
ax.invert_yaxis()
num_rows = bump_df_ranked.shape[0]
num_cols = bump_df_ranked.shape[1]
plt.ylabel('Ranking', fontsize=18, labelpad=10)
plt.title('Ranking of Total Shootings by Day of Week and District \n Bump Chart', fontsize=18, pad=15)
plt.xticks(fontsize=14)
plt.yticks(range(1, num_cols+1, 1), fontsize=14)
ax.set_xlabel('Day Of Week', fontsize=18)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels, bbox_to_anchor=(1.01, 1.01), fontsize=14, labelspacing=1, markerscale=0.4,
borderpad=1, handletextpad=0.8)
i = 0
j = 0
for eachcol in bump_df_ranked.columns:
for eachrow in bump_df_ranked.index:
this_rank = bump_df_ranked.iloc[i,j]
ax.text(i, this_rank, str(bump_df.iloc[j,i]), ha='center', va='center', fontsize=12)
i+=1
j+=1
i=0
plt.show()
The bump chart illustrates the ranking of total shootings by day of the week and district in Boston for the year 2022. Each line represents a district, with the x-axis showing the days of the week, and the total number of shootings labeled for each district on each day.
Districts B3, B2, and C11 consistently rank among the top 3 for shootings across the days of the week, suggesting that these areas experienced a higher frequency of shootings throughout the year. These districts appear to be the most dangerous, with a relatively high number of incidents occurring consistently from day to day, particularly on Saturdays and Sundays.
Both Saturday and Sunday are the days with the most incidents of shootings across the various districts, which may reflect increased social activities, gatherings, or other factors that elevate the likelihood of shootings during weekends. Districts like B3, B2, and C11, which rank high throughout the week, show even higher totals on these days.
District A15 ranks the lowest in shootings throughout the week, particularly from Tuesday to Friday, where it consistently reports 0 shootings. This positions A15 as the safest district in Boston during 2022, with very few incidents with shootings overall.
Some districts that generally rank low in terms of shootings during the week show an increase in incidents on Sunday. This trend may indicate that, while these districts are relatively safer on weekdays, weekends—especially Sundays—see an uptick in violence. The dynamics of weekend activities may be a factor contributing to this rise.
import numpy as np
df_2 = df.groupby(['HOUR', 'DAY_OF_WEEK']).agg({'DAY_OF_WEEK':['count']}).reset_index()
df_2.columns = ['Hour', 'Weekday', 'Count']
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)
my_colors = {'Monday':'blue',
'Tuesday':'red',
'Wednesday':'green',
'Thursday':'gray',
'Friday':'purple',
'Saturday':'gold',
'Sunday':'brown'}
for key, grp in df_2.groupby(['Weekday']):
grp.plot(ax=ax, kind='line', x='Hour', y='Count', color=my_colors[key[0]], label=key[0], marker='8')
plt.title('Total Incidents by Hour', fontsize=18)
ax.set_xlabel('Hour (24 Hour Interval)', fontsize=18)
ax.set_ylabel('Total Incidents', fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=16, rotation=0)
ax.set_xticks(np.arange(24))
handles, labels = ax.get_legend_handles_labels()
handles = [handles[1],handles[5],handles[6],handles[4],handles[0],handles[2],handles[3]]
labels = [labels[1],labels[5],labels[6],labels[4],labels[0],labels[2],labels[3]]
plt.legend(handles, labels, loc='best', fontsize=14, ncol=1)
plt.show()
The multiple line plot illustrates the total number of incidents in Boston 2022 recorded per hour across the week, with each line representing a different day of the week. A few key trends emerge from the data:
For each day of the week, the number of incidents is highest at hour 0 (midnight). This suggests that incidents are most prevalent during this time, possibly due to increased nighttime activity, such as late-night events, social gatherings, or a higher risk of alcohol-related incidents.
Immediately after the peak at hour 0, there is a sharp drop in incidents at hour 1 for every day. This could indicate a transition as the late-night activities wind down and people either return home or move into less risky environments.
The data shows that hour 4 (early morning) has the lowest number of incidents for Monday through Friday, while hour 6 sees the fewest incidents on Saturday and Sunday. This could reflect a period of reduced activity or a transition to the morning routine, as people are either asleep or just beginning their day, and fewer risky behaviors are occurring at these times.
Starting from hour 6 in the morning, there is a general upward trend in the number of incidents for each day. This suggests that as the day progresses, the likelihood of incidents increases, possibly due to people becoming more active, returning to work, or starting social activities.
For each day, there is a slight increase in incidents around hour 12 (noon). This midday spike could be attributed to a variety of factors, including lunch breaks, mid-day fatigue, or the start of social activities as people take breaks from work or school, potentially increasing exposure to incidents.
After reaching a peak around hour 18 (6 PM), the number of incidents begins to decrease for each day. This decline may be related to a reduction in risky activities as people return home, settle into the evening, or start winding down for the day.
On Saturday and Sunday, there are fewer incidents from hour 6 to hour 19, indicating lower levels of activity during the day. This may suggest that people tend to engage in fewer high-risk activities or stay home more during the daytime on weekends compared to weekdays.
stacked_df = df.groupby(['HOUR', 'DAY_OF_WEEK'])['SHOOTING'].sum().reset_index(name='TotalShootings')
stacked_df = stacked_df.pivot(index='HOUR', columns='DAY_OF_WEEK', values='TotalShootings')
stacked_df = stacked_df.reindex(columns=reversed(day_order))
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(1,1,1)
stacked_df.plot(kind='bar', stacked=True, ax=ax)
plt.ylabel('Total Shootings', fontsize=18, labelpad=10)
plt.title('Total Shootings by Hour and by Day \n Stacked Bar Plot', fontsize=18)
ax.set_xlabel('Hour (24 Hour Interval)', fontsize=18)
plt.xticks(rotation=0, horizontalalignment='center', fontsize=14)
plt.yticks(fontsize=14)
handles, labels = ax.get_legend_handles_labels()
handles = [handles[6], handles[5], handles[4], handles[3], handles[2], handles[1], handles[0]]
labels = [labels[6], labels[5], labels[4], labels[3], labels[2], labels[1], labels[0]]
plt.legend(handles, labels, loc='best', fontsize=14)
plt.show()
The stacked bar plot provides a detailed breakdown of total shootings by hour of the day and day of the week. The x-axis represents the hours of the day, from hour 0 (midnight) to hour 23 (11 PM), and the y-axis shows the total number of shootings. Each bar is stacked and color-coded by day of the week, showcasing the distribution of shootings across the days.
Hour 8 (8 AM) consistently reports the fewest shootings, with fewer than 5 shooting incidents recorded. Shootings during this hour are rare, occurring only on Saturday and Sunday. This could suggest that, in general, shootings are less likely to occur early in the morning on weekdays, possibly due to reduced activity or the fact that this is often a transition time from the nighttime to daytime hours.
The highest number of shootings occur at hour 0 (midnight) and hour 22 (10 PM). This transition time to late-night hours show the most incidents, especially on weekends, where a greater share of shootings occurs compared to weekdays. Hour 0 (midnight) likely reflects a time of heightened activity—late-night socializing, bar closing times, and possibly alcohol or drug-related incidents. Similarly, hour 22 (10 PM) could coincide with people being out later in the evening, social events, or other situations that may increase the likelihood of violent incidents.
After the peak at hour 0, the number of shootings gradually declines through the early morning hours, reaching its lowest point at hour 8. This could reflect a natural period of calmness as people sleep or engage in less risky activities during the early hours of the morning.
Starting at hour 9 (9 AM), the number of shootings begins to rise again, although it remains lower than the night-time peaks. This increase continues throughout the day, reaching its next peak at hour 22. This gradual climb may reflect the build-up of activity and the eventual return of socializing or late-night events, contributing to the heightened number of shootings as the evening progresses.
While there are no days with shootings recorded during every hour of the day, the trend indicates that weekends (particularly Saturday and Sunday) experience more incidents during the late-night and early-morning hours. In contrast, Wednesdays and Thursdays consistently show lower number of shootings at all hours, suggesting that these days may see fewer incidents overall, possibly due to lower social activity or different patterns of behavior during the midweek.
The graphs have ultimately revealed that there is a trend between activity and incidents. Seasons, months, days, and hours that are generally more active are more susceptible to incidents. By narrowing the incident distribution to shooting-related incidents, we have discovered that this pattern holds for violent incidents in addition to general incidents.