Introduction

First, we examine the data! Do we have good dates and locations? They are the 2 data points that will be used in our crime category prediction model, so let’s make sure we are using good data.

Data Overview

The raw training dataset consists of 878,049 and includes - * Dates and day of the week * Crime category, description and resolution * Address, location coordinates and police district Our first step in our analysis is to check for NAs and incomplete records since they will not be useful to train our models on. After removing NA records from our raw dataset, we see that there are actually 0 incomplete records. Our next step is to check the data types and see if we need to change any of those. Below are the summary statistics for our raw dataset.

# set echo to True so this code chunk is written to page
summary(raw_data)
##     Dates             Category           Descript        
##  Length:878049      Length:878049      Length:878049     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   DayOfWeek          PdDistrict         Resolution       
##  Length:878049      Length:878049      Length:878049     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    Address                X                Y        
##  Length:878049      Min.   :-122.5   Min.   :37.71  
##  Class :character   1st Qu.:-122.4   1st Qu.:37.75  
##  Mode  :character   Median :-122.4   Median :37.78  
##                     Mean   :-122.4   Mean   :37.77  
##                     3rd Qu.:-122.4   3rd Qu.:37.78  
##                     Max.   :-120.5   Max.   :90.00

Dates

Date and time often present a challenge for data analysts because there are so many formats possible. When we look at our raw data, we can see that the Dates data is in the format, Ymd hmms, like 2015-05-13 23:30:00. In our first pass at the dates, we’ll split the date and time components and store them in separate fields. While not the most elegant, we’ll just split on a space, though we suspect there is a likely a date time parsing method that would be a possibly more robust method.

##     Dates             Category           Descript        
##  Length:878049      Length:878049      Length:878049     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   DayOfWeek          PdDistrict         Resolution       
##  Length:878049      Length:878049      Length:878049     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    Address                X                Y        
##  Length:878049      Min.   :-122.5   Min.   :37.71  
##  Class :character   1st Qu.:-122.4   1st Qu.:37.75  
##  Mode  :character   Median :-122.4   Median :37.78  
##                     Mean   :-122.4   Mean   :37.77  
##                     3rd Qu.:-122.4   3rd Qu.:37.78  
##                     Max.   :-120.5   Max.   :90.00  
##                                                     
##       Date                           Time             Hour      
##  Min.   :2003-01-06 00:00:00   12:00:00: 22351   Min.   : 0.00  
##  1st Qu.:2006-01-11 00:00:00   00:01:00: 21831   1st Qu.: 9.00  
##  Median :2009-03-07 00:00:00   18:00:00: 19330   Median :14.00  
##  Mean   :2009-03-15 18:40:47   17:00:00: 16960   Mean   :13.41  
##  3rd Qu.:2012-06-11 00:00:00   20:00:00: 16294   3rd Qu.:19.00  
##  Max.   :2015-05-13 00:00:00   19:00:00: 16277   Max.   :23.00  
##                                (Other) :765006

Now that we have dates as date type, we can see that they do indeed range from 2003-01-06 to 2015-05-13. Let’s take a look at the distribution of crime category in our training data set.

Location

We also would like to validate the loation coordinates for these crimes to make sure that we only use valid San Francisco crime data in our prediction model. We’ll use ggmap to create a map and set zoom level far enough out to see if there are any outliers that we can visibly detect. We should also calculate distance from city center to check crimes as well.