import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'c:/ProgramData/Anaconda3/Library/plugins/platforms'

Introduction

This is my second analysis of a data set, but instead of using RStudio, I executed Python code in Jupyter Notebook. This publication is part of my “Data Visualization” class at Loyola University Maryland. The graphs/visualizations created pertain to a data set on Baltimore Crimes. I hope you are able to gain some useful insight from my graphs, and I hope they are able to inform you of which areas to use caution if you ever travel to Baltimore.

Dataset

The data set that I worked with for this analysis is a Baltimore Crimes data set. This data set actually includes crime entries from the 60’s, 70’s, 80’s, 90’s, and early 2000’s. I found this to be very interesting and was excited to include this in my analysis, but upon further investigation I found out that most of these years only had one entry. When I began manipulating the data in Jupyter notebook, I excluded all of the years in the data set with only a few entries, and focused my analysis on the years 2014-2020, since these years had a majority of the entries. These years were still able to produce over 300,000 results.

The main columns that I worked with in this data set are “CrimeDate”, “CrimeTime”, “Description”, and “Neighborhood”. Naturally, like many others, I wanted to determine which crimes were the most common and which neighborhoods were responsible for the most crimes. This data set includes almost every crime possible, including assault, auto theft, burglary, homicide, shooting, arson, etc. By using the “CrimeDate” and “CrimeTime” columns in the data set, I was able to pull information such as the hour, day, month, and year crimes were committed. This was useful for me as many of my graphs focused on the timing of when the crimes were committed. I was able to discover which hours, days, and months were responsible for the highest volume of crimes.

The data set did come with longitude and latitude coordinates of where each crime was committed. With this information I was able to create a map of Baltimore with little circles representing the spot a certain crime occurred. Unfortunately, I will not be able to publish this map on to RPubs. If you would like to view this map please contact me by email ( or ).

Findings

Here is some general information regarding my findings before you view the individual charts in their respective tabs.

  • A majority of the crimes committed are larceny (theft of personal property), assault, and burglary
  • A majority of the crimes were committed in the Downtown area, with several other neighborhoods receiving above average counts. Downtown is likely to be the leader from the higher volume of people constantly travelling through there.
  • A majority of the crimes were committed later in the day, and bottoming out in the early morning hours.
  • There is a higher volume of crimes during the warmer months and seasons, and a decline in crime activity during the colder months and seasons.

The findings for each graph will be explained in detail in their respective tabs.

NOTICE - Some visualizations may require you to zoom in (some visualizations were difficult to format in RMarkdown after being brought over as Python code from Jupyter Notebook)

Bar Chart

The first visualization is a bar chart that displays the top 10 crimes committed from 2014-2020. The type of crime is on the x axis, while the total count of the crime can be found on the y axis. I constructed this graph solely for the purpose to see what the most common crimes were from the time period that this data set represents. The top 3 crimes are larceny, common assault, and burglary. What makes this bar chart more unique than a regular bar chart, is the dashed line that represents the mean number of crimes committed. There is also a legend to accompany the dashed mean line, which represents whether a certain crime is above or below average. Any crime that is above average is represented in red, and any crime below average is represented in green. If there were any crimes within 1% of the average their bar would have been colored in black. The below average crimes include several different types of robberies, and shootings.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")

path = "U:/"

filename = path + "BPD_Part_1_Victim_Based_Crime_Data.csv"

df = pd.read_csv(filename, nrows=5)
df = pd.read_csv(filename, usecols = ['CrimeDate', 'CrimeTime', 'Neighborhood', 'Description'])

df['CrimeDate'] = pd.to_datetime(df['CrimeDate'], format = '%m/%d/%Y')
df['CrimeTime'] = pd.to_datetime(df['CrimeTime'])
df['Weekday'] = df.CrimeDate.dt.strftime('%a')
df['MonthName'] = df.CrimeDate.dt.strftime('%b')
df['Year'] = df.CrimeDate.dt.year

df = df[df['Year'] >= 2014]
df['Hour'] = df.CrimeTime.dt.hour

x = df.groupby(['Description']).agg({'Description':['count']}).reset_index()
x.columns = ['Description', 'Count']
x = x.sort_values('Count', ascending=False)
x.reset_index(inplace=True, drop=True)

def pick_colors_according_to_mean_count(this_data):
    colors=[]
    avg = this_data.Count.mean()
    for each in this_data.Count:
        if each > avg*1.01:
            colors.append('lightcoral')
        elif each < avg*0.99:
            colors.append('green')
        else:
            colors.append('black')
    return colors 
  
import matplotlib.patches as mpatches

bottom1 = 0
top1 = 9
d1 = x.loc[bottom1:top1]
my_colors1 = pick_colors_according_to_mean_count(d1)

Above = mpatches.Patch(color='lightcoral', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of Average')
Below = mpatches.Patch(color='green', label='Below Average')

fig = plt.figure(figsize=(24, 12))

ax1 = fig.add_subplot(1, 1, 1)
ax1.bar(d1.Description, d1.Count, label='Crime Count', color=my_colors1)
#ax1.legend(fontsize=14)
ax1.legend(handles=[Above, At, Below], fontsize=18)
plt.axhline(d1.Count.mean(), color='black', linestyle='dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.set_title('Top 10 Baltimore Crimes from 2014-2020', size=30)
ax1.text(top1-1, d1.Count.mean()+1000, 'Mean = ' + str(d1.Count.mean()), rotation=0, fontsize=20)
plt.xticks(fontsize=11)
plt.yticks(fontsize=16)
ax1.set_xlabel('Type of Crime', fontsize=24, labelpad=20)
ax1.set_ylabel('Crime Count', fontsize=24, labelpad=20)
plt.show()

Horizontal Bar Chart

The next visualization is another bar chart, except that it is horizontal and looks at the neighborhoods where the most crimes were committed. The neighborhood is on the y axis, while the count of total crimes is located on the x axis. Like the other bar chart, this graph has a line that represents the mean amount of crimes committed for each neighborhood. There is also an accompanying legend, where neighborhoods with an above average crime count have their bar colored red, while neighborhoods with a below average crime count have their bar colored green. If there were any bars with a crime count that was within 1% of the average, they would be colored in black. The color coding helps to easily see which neighborhoods or areas can be considered more of a hot zone for crimes. Each bar also has the total crime count from 2014-2020 for each neighborhood located to the right of the bar. The neighborhoods with above average crime counts are Downtown, Frankford, Belair-Edison, and Brooklyn. Downtown has a much higher number of crimes than any other neighborhood, likely because of its sheer size and density.


x2 = df.groupby(['Neighborhood']).agg({'Neighborhood':['count']}).reset_index()
x2.columns = ['Neighborhood', 'Count']
x2 = x2.sort_values('Count', ascending=False)
x2.reset_index(inplace=True, drop=True)

import matplotlib.patches as mpatches

bottom2 = 0
top2 = 14
d2 = x2.loc[bottom2:top2]
d2 = d2.sort_values('Count', ascending=True)
d2.reset_index(inplace=True, drop=True)
my_colors2 = pick_colors_according_to_mean_count(d2)

Above = mpatches.Patch(color='lightcoral', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of Average')
Below = mpatches.Patch(color='green', label='Below Average')

fig = plt.figure(figsize=(38, 20))
ax1 = fig.add_subplot(1, 1, 1)
ax1.barh(d2.Neighborhood, d2.Count, color=my_colors2)
for row_counter, value_at_row_counter in enumerate(d2.Count):
    if value_at_row_counter > d2.Count.mean()*1.01:
        color = 'lightcoral'
    elif value_at_row_counter < d2.Count.mean()*0.99:
        color = 'green'
    else:
        color = 'black'
    ax1.text(value_at_row_counter+100, row_counter, str(value_at_row_counter), color=color, size=28, fontweight='bold',
            ha='left', va='center', backgroundcolor='white')
plt.xlim(0, d2.Count.max()*1.1)
    
ax1.legend(loc='lower right', handles=[Above, At, Below], fontsize=30)
plt.axvline(d2.Count.mean(), color='black', linestyle='solid')
ax1.text(d2.Count.mean()+100, 0, 'Mean = ' + str(d2.Count.mean()), rotation=0, fontsize=34)
ax1.set_title('Top 15 Crime Neighborhoods in Baltimore (2014-2020)', size=45)
ax1.set_xlabel('Crime Count', fontsize=36)
ax1.set_ylabel('Neighborhood', fontsize=30)
plt.xticks(fontsize=30)
plt.yticks(fontsize=24)
plt.show()

Line Graph

This visualization is a line graph that analyzes the number of crimes committed by hour of the day, and by day of the week. The hour of the day is on the x axis ( it is a 24 hour interval), and the total number of crimes committed is on the y axis. Each day of the week is represented by a different color line on the graph, which can be referenced in the legend.

I created this visualization in an attempt to discover if there were any trends or upticks in crimes during certain days of the week or hours of the day. From analyzing the graph, both Saturday and Sunday are the leaders in total crimes from about midnight until 5 am. After 5 am, the weekdays have very similar total crime numbers as well as higher crime counts than the Saturday and Sunday. What is very interesting to see in this visualization, is that each day of the week follows almost the same pattern in regards to total crimes committed by hour of the day. The crimes are relatively high at midnight (hour 0) for each day, and then the crimes go on a steep decline until around 5 or 6 am. The crimes then increase for each day until around noon (hour 12), where all days of the week see a sudden spike in crimes. This was interesting to see, and the increase could be attributed to most people being awake and active at this point in the day. The crimes for each day continue to increase and reach their peak around 6 pm (hour 18), before they begin their steady decline as the night goes on. I assumed that a majority of crimes would be committed later in the day or at night, and it was very interesting to see which hours were responsible for the most crimes according to the graph.


line_df = df.groupby(['Hour', 'Weekday'])['Description'].count().reset_index(name='CrimeCount')

from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize = (22,16))
ax = fig.add_subplot(1, 1, 1)

my_colors = {'Mon':'blue',
             'Tue':'red',
             'Wed':'green',
             'Thu':'gray',
             'Fri':'purple',
             'Sat':'gold',
             'Sun':'brown'}

for key, grp in line_df.groupby(['Weekday']):
    grp.plot(ax=ax, kind='line', x='Hour', y = 'CrimeCount', color=my_colors[key], label=key, marker='8')
plt.title('Line Graph of Total Crimes by Hour and by Day', fontsize=30)
ax.set_xlabel('Hour of the Day (24 Hour Interval)', fontsize=28, labelpad=20)
ax.set_ylabel('Total Crimes Committed', fontsize=28, labelpad=20)
ax.tick_params(axis='x', labelsize=20, rotation=0)
ax.tick_params(axis='y', labelsize=20, rotation=0)

ax.set_xticks(np.arange(24))
handles, labels = ax.get_legend_handles_labels()
handles = [ handles[1], handles[5], handles[6], handles[4], handles[0], handles[2], handles[3] ]
labels  = [  labels[1],  labels[5],  labels[6],  labels[4],  labels[0],  labels[2],  labels[3] ]
plt.legend(handles, labels, loc='lower right', fontsize=22, ncol=1)
ax.get_yaxis().set_major_formatter((lambda x, p: format(int(x), ',')))

plt.show()

Nested Pie Chart

The next visualization that I created is a nested pie chart. This is my favorite visualization that I created, due to the amount of information that can be derived from it. I began by creating a pie chart that had four slices, one for each quarter of the year. My goal in creating this visualization was to discover which quarters and which months were responsible for the most amount of crimes. I was wondering if there would be an increase of crime during the summer months, and/or a decrease during the winter months due to outdoor temperatures and weather conditions.

The first layer of this nested pie chart is split up into four sections, one for each quarter of the year. The percent of total crimes and total crime count is visible for each quarter. Quarter 2 and Quarter 3 represent a higher percentage of the total than Quarters 1 and 4. The largest difference is between Quarter 3 and Quarter 1, with a difference of about 16,000 crimes. This could be a result from more people being active in the summer months instead of the winter months, leading to more crimes being committed.

The next layer of the nested pie chart is the individual months that comprise each quarter of the year. The visualization displays each month in the second layer and its accompanying percentage of total crimes for the time period from 2014-2020. This allows us to take a deeper dive into which months were responsible for the most crimes, in addition to looking at which quarters were responsible for the most crimes. August is responsible for the highest percentage of crimes, while February is responsible for the lowest. This supports the idea that the seasons have an effect on the volume of crimes.

Finally, the total number of crimes committed from 2014-2020 in this data set is shown in the center of the nested pie chart. This makes it easier to see the amount of crimes each quarter and month is responsible for from the total.


df['Month'] = df.CrimeDate.dt.month
df['Quarter'] = 'Quarter ' + df.CrimeDate.dt.quarter.astype('string')
pie_df = df.groupby(['Quarter', 'MonthName', 'Month'])['Description'].count().reset_index(name='CrimeCount')
pie_df.sort_values(by=['Month'], inplace=True)
pie_df.reset_index(inplace=True, drop=True)
del pie_df['Month']

number_outside_colors = len(pie_df.Quarter.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4

number_inside_colors = len(pie_df.MonthName.unique())
all_color_ref_number = np.arange(number_outside_colors + number_inside_colors)

inside_color_ref_number = []
for each in all_color_ref_number:
    if each not in outside_color_ref_number:
        inside_color_ref_number.append(each)

fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(1, 1, 1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)

all_crimes = pie_df.CrimeCount.sum()

pie_df.groupby(['Quarter'])['CrimeCount'].sum().plot(
       kind='pie', radius=1, colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1,
       wedgeprops = dict(edgecolor='White'), textprops= {'fontsize':13},
       autopct = lambda p: '{:.2f}%\n({:,.0f})'.format(p,(p/100*all_crimes)), 
       startangle=90)
inner_colors = colormap(inside_color_ref_number)
pie_df.CrimeCount.plot(
       kind='pie', radius=0.7, colors = inner_colors, pctdistance = 0.55, labeldistance = 0.8,
       wedgeprops = dict(edgecolor='White'), textprops= {'fontsize':11},
       labels = pie_df.MonthName, 
       autopct = '%1.2f%%', 
       startangle=90)
hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.yaxis.set_visible(False)
plt.title('Total Crimes by Quarter and Month', fontsize=14)
ax.text(0, 0, 'Total Crimes\n' + "{:,}".format(all_crimes), size=13, ha='center', va='center')
ax.axis('equal')
plt.tight_layout()

plt.show()

Bump Chart

This visualization is a bump chart, and focuses on the ranking of each year during each month of the year. This visualization includes the years 2014-2019. The year 2020 is not included, since the data set only includes data up until November of 2020. Each month of the year can be found on the x axis, and the monthly ranking for each year is located on the y axis. The monthly ranking is from 1-6, since there are 6 years included in this visualization. This chart helps us to determine which years were responsible for the most crimes. The year 2017 is one of the leaders in total crimes committed, where it led in the months January - June. The year 2014 is responsible for one of the lowest amounts of crimes out of all the years, as it ranks towards the bottom for many of the months. What is unique about this graph is that you can easily follow the line for each year to see where it ranks for each months. Another unique aspect of this visualization is that each year has a bubble for each month, which shows the total amount of crimes that were committed for that year and month. By looking at the leading number of crimes for each month, you can gain an idea of which months of the year were responsible for the most and least amount of crimes. Overall, this bump chart is an effective way to analyze and gain an idea of which years were the leaders of total crimes for each month and in total.


bump_df = df.groupby(['Year', 'MonthName'])['Description'].count().reset_index(name='TotalCrimes')

bump_df = bump_df.pivot(index='Year', columns='MonthName', values = 'TotalCrimes')

month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

bump_df = bump_df.reindex(columns=month_order)

bump_df = bump_df.dropna()

bump_df_ranked = bump_df.rank(0, ascending=False, method='min')

bump_df_ranked = bump_df_ranked.T

fig = plt.figure(figsize=(22,16))
ax = fig.add_subplot(1, 1, 1)

bump_df_ranked.plot(kind='line', ax=ax, marker='o', markeredgewidth=1, linewidth=6,
                   markersize=54,
                   markerfacecolor='white')
ax.invert_yaxis()

num_rows = bump_df_ranked.shape[0]
num_cols = bump_df_ranked.shape[1]

plt.ylabel('Monthly Ranking', fontsize=24, labelpad=10)
plt.title('Ranking of Total Crimes by Month and by Year (2014-2019) \n Bump Chart', fontsize=28, pad=15)
plt.xticks(np.arange(num_rows), month_order, fontsize=18)
plt.yticks(range(1, num_cols+1, 1), fontsize=18)
ax.set_xlabel('Month', fontsize=24)
handles, labels = ax.get_legend_handles_labels()
handles = [ handles[5], handles[4], handles[3], handles[2], handles[1], handles[0] ]
labels  = [  labels[5],  labels[4],  labels[3],  labels[2],  labels[1],  labels[0] ]
ax.legend(handles, labels, bbox_to_anchor=(1.01, 1.01), fontsize=20,
         labelspacing = 1,
         markerscale = .4, 
         borderpad = 1, 
         handletextpad = 0.8)
i = 0
j = 0
for eachcol in bump_df_ranked.columns:
    for eachrow in bump_df_ranked.index:
        this_rank = bump_df_ranked.iloc[i, j]
        ax.text(i, this_rank, str(int(bump_df.iloc[j, i])), ha='center', va='center', fontsize=17)
        i+=1
    j+=1
    i=0
plt.show()

Heatmap

The final visualization is a heatmap. This heatmap is essentially a different way to look at and analyze the information that was presented in the line graph. On the x axis of the heatmap is the hour of the day (a 24 hour interval), and on the y axis is the day of the week. In the line graph you can analyze the peaks and dips in the lines to see when most of the crimes were committed, whereas in the heatmap you can look at the colors to determine when the most of the crimes were committed. In this heatmap, the color blue represents a low amount of crimes, while the darker red color represents a higher amount of crimes. This is represented on the legend on the right side of the visualization. It is clearly visible that the first few hours of the day have more blue squares, signifying a low amount of criminal activity. As the day goes on, the boxes become a higher shade of red, indicating a higher amount of criminal activity.

My apologies for this graph being slightly difficult to see. Hopefully zooming in helps and you are able to see the numbers. The blue and red colors are the main concept of this visualization and can still be used to help you understand what the visualization represents.


hm_df = df.groupby(['Hour', 'Weekday'])['Description'].count().reset_index(name='CrimeCount')

hm_df = hm_df.pivot(index='Weekday', columns='Hour', values='CrimeCount')

day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

hm_df = hm_df.reindex(index=reversed(day_order))

import seaborn as sns
from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize=(20, 6))
ax = fig.add_subplot(1, 1, 1)

comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))

ax = sns.heatmap(hm_df, linewidth = 0.2, annot = True, cmap = 'coolwarm', fmt=',.0f',
                 square = True, annot_kws={'size': 12}, 
                 cbar_kws = {'format': comma_fmt, 'orientation':'vertical'})

plt.title('Heatmap of the Number of Crimes by Day and by Hour', fontsize=26, pad=15)
plt.xlabel('Hour of the Day (24 Hour Interval)', fontsize=20, labelpad=10)
plt.ylabel('Day of the Week', fontsize=20, labelpad=10)
plt.yticks(rotation=0, size=16)
plt.xticks(size=14)
cbar = ax.collections[0].colorbar

max_count = hm_df.to_numpy().max()

my_colorbar_ticks = [*range(500, max_count, 250)]
cbar.set_ticks(my_colorbar_ticks)

my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)

cbar.set_label('Number of Crimes', rotation = 270, fontsize=20, color='black', labelpad=30)

plt.show()

Conclusion

We are finished analyzing the charts. Here are some general takeaways from my output:

  • The crimes that are most common from this time period are the ones that are less complicated (larceny, assault, and burglary). There are a lot less robberies, shootings, and other major crimes than I imagined there to be before analyzing this data set.
  • The area with the highest crime is Downtown, which makes sense since there is a larger volume of people in that area. Frankford and Belair-Edison are the areas with the next highest crime activity, and they are located very close to each other. Brooklyn is another area with above average crime activity, and it is located south of Downtown.
  • Throughout the day, crimes are at a low in the early hours of the morning, and they slowly increase throughout the day and peak at around dinnertime. The activity begans to slowly decrease after this peak, but still remains higher than most other hours of the day. Obviously less crimes are committed when everyone is sleeping. There is a sharp increase in activity around noon, which was very interesting to see.
  • From 2014-2020, a majority of the crimes were committed during the middle of the year, especially the summer months. There were less crimes committed during the beginning and end of the year, especially during the cold, winter months.
  • 2017 seemed to be the leading year for most crimes committed, while 2014 seemed to be the year with the least amount of crimes committed.

Overall, 2014-2020 is a large range with a high amount of data collected on crimes in Baltimore. The patterns and trends discovered from these graphs can likely be inferred for the future months and years. If you are in Baltimore, keep the information from these graphs in mind and stay safe!

Thank you for checking out my visualizations!