Analysis of Crime Reports from Chicago 2001-2020

The dataset I have chosen to analyze takes data from police calls from the City of Chicago between 2001 and 2020. The dataset has over 7 million rows of data and is organized into twenty two variables. This dataset covers various incidents across the twenty year span and includes several different types of data, including the date and time each call came in, what district of Chicago where the incident occurred, and what type of setting where the incident took place.

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'c:/users/omalx/anaconda3/Library/plugins/platforms'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")

path = "C:/Users/omalx/Documents/IS 460 Data Visualization/Python Work/Chicago Crime/"

filename = path + 'Crimes 2001 to Present.csv'
df = pd.read_csv(filename, usecols=['Date', 'Primary Type', 'Description', 'District', 'Arrest', 'Location Description'])
df['Location Description'].fillna("Not Available", inplace=True)
df['Date'] = pd.to_datetime(df['Date'], format = '%m/%d/%Y %I:%M:%S %p')
df['Hour'] = df.Date.dt.hour
df['Day'] = df.Date.dt.day
df['Month'] = df.Date.dt.month
df['Year'] = df.Date.dt.year
df['Weekday'] = df.Date.dt.strftime('%a')
df['MonthName'] = df.Date.dt.strftime('%b')

Vertical Bar Chart of Top Crimes Called In

This graph is a vertical bar chart showing the top eight most frequent types of calls made to police for the entirety of the dataset. Each bar covers a type of crime, such as arson, assault, etc. and the eight most popular are about what would be expected. An interesting aspect of the graph is the “Other Offense” column, since the list of offenses is very broad, including battery, gambling, interference with a police officer, and many more offenses. This column likely includes rarer offenses such as failing to appear in court. Theft and battery being the most common offenses is understandable since those tend to be crimes of opportunity or circumstance and are the easiest to carry out by people.

x = df.groupby(['Primary Type']).agg({'Primary Type':['count']}).reset_index()
x.columns = ['OffenseType', 'Count']
x = x.sort_values('Count', ascending=False)
x.reset_index(inplace=True, drop=True)

from matplotlib.ticker import FuncFormatter
bottom1 = 0
top1 = 7
d1 = x.loc[bottom1:top1]
fig=plt.figure(figsize=(18,10))
ax1 = fig.add_subplot(1, 1, 1)
ax1.bar(d1.OffenseType, d1.Count, label='Report Count', color='slateblue')
ax1.legend(loc='upper right', fontsize=14)
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.set_ylabel('Reports (in millions)', fontsize=16)
ax1.set_xlabel('Offense Type', fontsize=16)
ax1.set_title('Top ' + str(top1) + ' Incidents Called in to Police by Offense Type', size=20)
ax1.yaxis.set_major_formatter(FuncFormatter(lambda z, pos:('%1.1fM')%(z*1e-6)))
plt.show()

Horizontal Bar Chart of Locations

The second of these graphs is a horizontal bar chart that shows the most frequent locations where each of these calls originated. These places can be just about anywhere, from houses/apartments to the street to office buildings and so on, and in this chart the top eight are shown. The “Other” category makes an appearance again, which is unusual since like the bar chart for types of offenses, there are several categories for the calls to be placed under. Residence was near the top as expected, considering Chicago has a high population and a considerable density of residences. An interesting column to appear on the chart is the public building column. This column likely includes public schools, municipal buildings, and public housing, but could be held back by not including private schools and office buildings, which aren’t public property.

x2 = df.groupby(['Location Description']).agg({'Location Description':['count']}).reset_index()
x2.columns = ['Location', 'Count']
x2 = x2.sort_values('Count', ascending=False)
x2.reset_index(inplace=True, drop=True)
from matplotlib.ticker import FuncFormatter

bottom2 = 0
top2 = 7
d2 = x2.loc[bottom2:top2]
fig=plt.figure(figsize=(18,10))
ax1 = fig.add_subplot(1, 1, 1)
ax1.barh(d2.Location, d2.Count, label='Report Count', color='mediumorchid')
ax1.legend(loc='upper right', fontsize=14)
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.set_xlabel('Reports (in millions)', fontsize=16)
ax1.set_ylabel('Offense Type', fontsize=16)
ax1.set_title('Top ' + str(top2) + ' Locations Police Called to', size=20)
ax1.xaxis.set_major_formatter(FuncFormatter(lambda z, pos:('%1.1fM')%(z*1e-6)))
plt.show()

Line Chart of Reports by Hour

The third of these graphs is a line chart that shows how many reports were called in at each hour of the day. Each point on the line is one hour in military time, so 0 is 12 AM, 13 is 1 PM, so on and so forth up through 11 PM at night. One interesting observation is how the weekend days of Saturday and Sunday have a different pattern to them than the weekdays, with more reports occurring early Saturday morning, around one to four in the morning, and late Saturday night through early Sunday morning. This can be attributed to people being more active on weekend nights than weekday nights. This could be because many people have to consider that they have to go to work throughout the week and have more off time during the weekend. Another interesting feature is the spike for all levels around noon, since that is during the typical lunch rush for many people.

x3 = df.groupby(['Hour', 'Weekday']).agg({'Hour':['count']}).reset_index()
x3.columns = ['Hour', 'Weekday', 'Count']

fig=plt.figure(figsize=(18,10))
ax=fig.add_subplot(1, 1, 1)
my_colors = {'Mon':'maroon',
             'Tue':'mediumaquamarine',
             'Wed':'goldenrod',
             'Thu':'lightcoral',
             'Fri':'dodgerblue',
             'Sat':'darkmagenta',
             'Sun':'sienna'
            }
for key, grp in x3.groupby(['Weekday']):
    grp.plot(ax=ax, kind='line', x='Hour', y='Count', color=my_colors[key], label=key, marker='8')
plt.title('Total Reports by Hour', fontsize=18)
ax.set_xlabel('Hour (24 Hour Interval)', fontsize=18)
ax.set_ylabel('Total Reports', fontsize=18)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)
ax.set_xticks(np.arange(24))
handles, labels = ax.get_legend_handles_labels()
handles = [handles[1], handles[5], handles[6], handles[4], handles[0], handles[2], handles[3]]
labels = [labels[1], labels[5], labels[6], labels[4], labels[0], labels[2], labels[3]]
plt.legend(handles, labels, loc='best', fontsize=14, ncol=1)
plt.show()

Nested Pie Chart of Reports by Month & Quarter

This graph is a nested pie chart that looks at what percentage of reports came in for each month and quarter of the year, respectively. The outer ring of the pie shows what percentage and the number of reports that came in during that quarter, where the inner ring shows the percentage by month, going through the correct calendar order. One interesting feature is the fact that the quarter with the most reports is the third quarter, which is July, August, and September. This quarter takes up the largest portion likely due to the fact that most schools are out for the summer, meaning there are likely more younger people on the streets causing these reports to come in, whereas the winter months take up smaller portions of the data since people tend to stay inside more during Chicago’s cold and windy winters.

df['Quarter'] = 'Quarter ' + df.Date.dt.quarter.astype('string')
x4 = df.groupby(['Quarter', 'MonthName', 'Month']).agg({'MonthName':['count']}).reset_index()
x4.sort_values(by=['Month'], inplace=True)
x4.reset_index(inplace=True, drop=True)
del x4['Month']
x4.columns = ['Quarter', 'Month', 'Reports']

outside_number_colors = len(x4.Quarter.unique())
outside_color_ref_number = np.arange(outside_number_colors)*4

number_inside_colors = len(x4.Month.unique())
all_color_ref_number = np.arange(outside_number_colors + number_inside_colors)

inside_color_reference_number = []
for each in all_color_ref_number:
    if each not in outside_color_ref_number:
        inside_color_reference_number.append(each)
print(outside_color_ref_number)
print(inside_color_reference_number)
fig = plt.figure(figsize=(7,7))
ax=fig.add_subplot(1, 1, 1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)
inner_colors = colormap(inside_color_reference_number)

allreports = x4.Reports.sum()

x4.groupby(['Quarter'])['Reports'].sum().plot(
    kind='pie', radius=1, colors=outer_colors, pctdistance=0.85, labeldistance = 1.1,
    wedgeprops = dict(edgecolor='white'), textprops={'fontsize':13},
    autopct= lambda p: '{:.2f}%\n({:.1f}M)'.format(p,(p/100)*allreports/1e+6),
    startangle=90)

x4.Reports.plot(
    kind='pie', radius=0.7, colors=inner_colors, pctdistance=0.55, labeldistance = 0.8,
    wedgeprops = dict(edgecolor='white'), textprops={'fontsize':11}, labels=x4.Month, 
    autopct= '%1.2f%%', startangle=90)
hole=plt.Circle((0,0), 0.3, fc='white')
fig1=plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
ax.text(0,0, 'Total Reports: \n' + str(round(allreports/1e+6, 2)) + 'M', ha='center', va='center', size=12)
plt.title('Total Reports by Quarter & Month', fontsize=18)
ax.axis('equal')
plt.tight_layout()

plt.show()

Heatmap of Reports by Month & Year

The last of these graphs is a heatmap, showing the concentration of the number of crimes reported for each month of the year of the dataset. One interesting observation is the larger numbers being concentrated to the earlier part of the decade, particularly in July and August of both 2011 and 2012. Though odd, it could be attributed to better social programs being implemented over the past decade lowering the numbers of people who’d either be first time or repeat offenders. Unfortunately, the dataset ends after November of 2020, so there is not a complete final year of the heatmap, which prevents even a square with a zero in it.

x5 = df.groupby(['Year', 'Month']).agg({'Month':['count']}).reset_index()
x5.columns = ['Year', 'Month', 'Reports']
x5 = x5.loc[~x5['Year'].isin(range(2011))]
x5 = x5.reset_index(drop=True)
x5['ReportsHundreds'] = round(x5['Reports']/100, 0)
x5_df = pd.pivot_table(x5, index='Year', columns='Month', values='Reports')

import seaborn as sns
from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize=(12,8))
ax=fig.add_subplot(1, 1, 1)

comma_fmt = FuncFormatter(lambda z, p: format(int(z), ','))

ax=sns.heatmap(x5_df, linewidth = 0.2, annot=True, cmap='coolwarm', 
              fmt=',.0f', square=True, annot_kws={'size':11},
              cbar_kws={'format':comma_fmt, 'orientation':'vertical'})
plt.title('Heatmap of Number of Reports by Month & Year', fontsize=18, pad=15)
plt.xlabel('Month', fontsize=18, labelpad=10)
plt.ylabel('Year', fontsize=18, labelpad=10)
plt.yticks(rotation=0, size=14)
plt.xticks(size=14)
ax.invert_yaxis()

cbar = ax.collections[0].colorbar
cbar.set_label('Number of Reports', rotation = 270, fontsize=14, labelpad=15)

plt.show()

Sources: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy