Digging into the Seattle and San Francisco datasets released

We have been provided with datasets from cities of Seattle and San Francisco reporting criminal incidents in summer 2014.
This paper is about analyzing those datasets, visualazing and comparing patterns accross those two cities.
This analysis allowed us to infer that :

We will go trhough all those statements in detail in this paper.

If you want to see the report with code, it’s all available at this link : report with code or at the link under each figure.

Datasets overview

First, we load the datasets, recorded in a .csv format, and see what we are provided with.

What we can observe from those values :

First, let’s transform those variables and create a single dataframe so that to compare values for those two cities.

In order to create this signle dataframe, we have to convert all the crimes descriptions in each dataset so that both cities present the same categories.
To do so, we have collected the National Incident-Based Reporting System (NIBRS) data, which categorize crimes into universal categories.
This dataset can be found here : http://data.denvergov.org/download/gis/crime/csv/offense_codes.csv
The NIBRS data contains 15 high categories of crimes, and 294 sub-categories.
As we can observe from the Seattle and the San Francisco datasets :

Let’s make a plot of this whole new dataset.

report with code

This plot seems to tell us that the number of offenses is higher in Seattle than in San Francisco. Indeed, we could have noticed before that the number of records in the Seattle dataset (32 779) is higher than in the San Francisco dataset (28 993). This trend may be just due to the fact that the number of people living in Seattle is greater than the ones living in San Francisco.
Let’s check.

So if we plot again the evolution of the number of offenses over time, but this time taking into account the population, and so plotting this time the number of offenses per capita, here is what we get.

report with code

Let’s notice that we have here plotted a smooth curve representing the trend for each city.
This plot allows us to understand that there are more offenses happening in summer 2014 in Seatlle than in San Francisco, this not being due to the demographics. Indeed, Seattle counts less inhabitants than San Francisco, that fact deeping the gap between the two cities in terms of number of offenses.

Let’s dig deeper.

Types of offenses

Now we are going to watch the different types of offenses, trying to understand what kind happens the most. Let’s have alook first at the general distribution, accross both cities.

report with code

So, with this plot, we figure out that the category of offenses occuring the most is “All Other Crimes”. That’s not surprising because this category regroups a lot of sub-categories. The second most important one is “Larceny”. Globally, we can print the top 5 most occuring offenses this way.

Do those categories are the same in both cities?

report with code

Here we see clearly that, appart from the category “All Other Crimes” for which in both cities the frequency is high, trends depend upon cities.
In Seattle, we observe that “Larceny” is far less occuring than in San Francisco. On the opposite, “Public Disorder” is far more occuring in Seattle while in San Francisco it stays relatively marginal.

Time evolution of offenses

We can go further and watch those types of offenses month by month.

report with code

This plot gives us an overview of the offenses’evolution month by month, but it would be much clearer to plot lines. We will here focus on “Larceny” and “Public Disorder” as they are the most different categories between both cities. Let’s try it.

report with code

In this plot, we can observe that there is a speficic trend, the same one in both cities :

There is certainly here more to investigate, which we would do if we had more time.