San Francisco Crime Classification

In this project we will be visualizing data from Kaggle’s San Francisco crime competition. While the main aim of the competition is to predict a crime category based on data such as location and time of day, first we will make a map have an initial peek at our data.

For this we will be using the awesome leaflet package, which gives us a nice interface to the popular leafletjs javascript library, designed for the production of interactive maps.

library(leaflet)

And now, let’s load our data. Since we won’t be doing actual machine learning, we will just use the training data.

train <-  read.csv("../data/train.csv")
dim(train)

## [1] 878049      9

In order to save some computing time we will subset the data and use just 2000 points picked at random.

subset_df <- train[sample(2000), ]

And finally, let’s create the map and add some markers to it.

leaflet(data = subset_df) %>% 
    addTiles() %>% 
    addMarkers(~X, ~Y,
               clusterOptions = markerClusterOptions()
    )

By observing these data we can already see some patterns. Some districts have more crimes committed than others (this is shown by the color of the circles, for more details you can click the markers). For example the busy city center seems to be more criminal. An interesting exercise for the reader would be to visualize the different types of crime. In a future report we will try to predict crime categories!

San Francisco Crime Classification

Boyan Angelov

7 Sep 2015