First, we examine the data! Do we have good dates and locations? They are the 2 data points that will be used in our crime category prediction model, so let’s make sure we are using good data.
The raw training dataset consists of 878,049 and includes - * Dates and day of the week * Crime category, description and resolution * Address, location coordinates and police district Our first step in our analysis is to check for NAs and incomplete records since they will not be useful to train our models on. After removing NA records from our raw dataset, we see that there are actually 0 incomplete records. Our next step is to check the data types and see if we need to change any of those. Below are the summary statistics for our raw dataset.
# set echo to True so this code chunk is written to page
summary(raw_data)
## Dates Category Descript
## Length:878049 Length:878049 Length:878049
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## DayOfWeek PdDistrict Resolution
## Length:878049 Length:878049 Length:878049
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Address X Y
## Length:878049 Min. :-122.5 Min. :37.71
## Class :character 1st Qu.:-122.4 1st Qu.:37.75
## Mode :character Median :-122.4 Median :37.78
## Mean :-122.4 Mean :37.77
## 3rd Qu.:-122.4 3rd Qu.:37.78
## Max. :-120.5 Max. :90.00
Date and time often present a challenge for data analysts because there are so many formats possible. When we look at our raw data, we can see that the Dates data is in the format, Ymd hmms, like 2015-05-13 23:30:00. In our first pass at the dates, we’ll split the date and time components and store them in separate fields. While not the most elegant, we’ll just split on a space, though we suspect there is a likely a date time parsing method that would be a possibly more robust method.
## Dates Category Descript
## Length:878049 Length:878049 Length:878049
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## DayOfWeek PdDistrict Resolution
## Length:878049 Length:878049 Length:878049
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Address X Y
## Length:878049 Min. :-122.5 Min. :37.71
## Class :character 1st Qu.:-122.4 1st Qu.:37.75
## Mode :character Median :-122.4 Median :37.78
## Mean :-122.4 Mean :37.77
## 3rd Qu.:-122.4 3rd Qu.:37.78
## Max. :-120.5 Max. :90.00
##
## Date Time Hour
## Min. :2003-01-06 00:00:00 12:00:00: 22351 Min. : 0.00
## 1st Qu.:2006-01-11 00:00:00 00:01:00: 21831 1st Qu.: 9.00
## Median :2009-03-07 00:00:00 18:00:00: 19330 Median :14.00
## Mean :2009-03-15 18:40:47 17:00:00: 16960 Mean :13.41
## 3rd Qu.:2012-06-11 00:00:00 20:00:00: 16294 3rd Qu.:19.00
## Max. :2015-05-13 00:00:00 19:00:00: 16277 Max. :23.00
## (Other) :765006
Now that we have dates as date type, we can see that they do indeed range from 2003-01-06 to 2015-05-13. Let’s take a look at the distribution of crime category in our training data set.
We also would like to validate the loation coordinates for these crimes to make sure that we only use valid San Francisco crime data in our prediction model. We’ll use ggmap to create a map and set zoom level far enough out to see if there are any outliers that we can visibly detect. We should also calculate distance from city center to check crimes as well.