Introduction

This document shows the steps taken to get the number of crimes per zip code for 2014 in Dallas County.

This analysis uses Dallas Open Data (https://www.dallasopendata.com/) to calculate the number of crimes per zip code in Dallas County. As pointed out on Dallas Police Public Data website (http://www.dallaspolice.net/publicdata/), the data that the police supply to the public is sample data, so the data cannot be used to supply official statistics.

To get a list of Dallas County zip codes, I used http://www.unitedstateszipcodes.org/ and used just the zip codes needed for this example analysis.

Analysis

Requirements

Add required packages.

library(lubridate)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Gather and Combine Data

Get the crime data from dallasopendata.com

crime.data.file <- "crime.csv"
if(!file.exists(crime.data.file)){
  download.file("http://www.dallasopendata.com/api/views/ftja-9jxd/rows.csv?accessType=DOWNLOAD",
                destfile=crime.data.file)
}

Disclaimer: The data supplied by Dallas Police Department is sampled and should not be used for statistical purposes, but we should be able to extract enough informaation to get a general idea of where crime is concentrated.

The Dallas Police Department implemented a new Records Management System (RMS) on June 1, 2014. To get crime data for 2014, two datasets are needed.

rms.file <- "rms.csv"
if(!file.exists(rms.file)){
  download.file("http://www.dallasopendata.com/api/views/tbnj-w5hb/rows.csv?accessType=DOWNLOAD",
                destfile=rms.file)
}

Read both parts of the the crime data into a data.frame.

crime.data.part1 <- read.csv(crime.data.file,
                       as.is = TRUE)
crime.data.part2 <- read.csv(rms.file,
                             as.is = TRUE)

Get the columns that are needed for this analysis.

crime.data <- dplyr::select(crime.data.part1, offensedate, offensetimedispatched, offensezip)
temp <- dplyr::select(crime.data.part2, Date1, Time1, ZipCode)

Change names of the columns to match the first set of data

colnames(temp) <- c("offensedate", "offensetimedispatched", "offensezip")

Remove data that is not in the year 2014 and is before June 1, 2014 from the RMS set of data.

temp <- mutate(temp, tempdate = as.Date(temp$offensedate,
                                        format="%m/%d/%Y"))

temp <- temp[as.Date(temp$tempdate) >= as.Date("2014-06-01") 
             & year(temp$tempdate) == 2014 
             & !is.na(temp$tempdate),]

Remove the tempdate column

tempdateindex <- grep("^tempdate$", colnames(temp))
temp <- temp[,-tempdateindex]

Bind the two data sets

crime.data <- rbind(crime.data, temp)

Analyze the Data

Check our date range of the data

crime.data$offensedate <- as.Date(crime.data$offensedate,
    format="%m/%d/%Y")

paste("Min is ", min(crime.data$offensedate), sep=" ")
## [1] "Min is  1994-03-15"
paste("Max is ", max(crime.data$offensedate), sep=" ")
## [1] "Max is  2014-12-31"

Check if the data is what is expected

crime.data <- mutate(crime.data, offenseyear = year(crime.data$offensedate))

crime <- group_by(crime.data, offenseyear)
summarize(crime, countsperyear = length(offenseyear))
## Source: local data frame [20 x 2]
## 
##    offenseyear countsperyear
## 1         1994             1
## 2         1995             1
## 3         1996             1
## 4         1997             1
## 5         1998             1
## 6         2000             2
## 7         2001             4
## 8         2002             4
## 9         2003            12
## 10        2004            21
## 11        2005            14
## 12        2006            14
## 13        2007            23
## 14        2008            33
## 15        2009            31
## 16        2010            60
## 17        2011            80
## 18        2012           270
## 19        2013         24124
## 20        2014        108900

Most observations happen in 2014, so it appears that data is not available for other years

Get data for 2014

crime.data <- crime.data[crime.data$offenseyear == 2014,]

Check the distribution per month

crime.data <- mutate(crime.data, offensemonth = month(crime.data$offensedate))
crime.data.month <- group_by(crime.data, offensemonth)
summarize(crime.data.month, countspermonth = length(offensemonth))
## Source: local data frame [12 x 2]
## 
##    offensemonth countspermonth
## 1             1          10739
## 2             2           9871
## 3             3          11460
## 4             4          12074
## 5             5          12177
## 6             6           6971
## 7             7           7780
## 8             8           7511
## 9             9           7288
## 10           10           8040
## 11           11           7365
## 12           12           7624

For 2014, what are the zip codes with the most crimes?

I was able to get the dallas zip codes and lat longs of the zip codes from http://www.unitedstateszipcodes.org/

dallas.zips <- read.csv("zip_code_database.csv")
zip.index <- grep("^offensezip$", colnames(crime.data))
colnames(crime.data)[zip.index] <- "zip"
merged.data <- merge(crime.data, dallas.zips, by = "zip")

crime.data.zip <- group_by(merged.data, zip)
summary <- summarise(crime.data.zip,  
          countsperzip = length(zip) )

head(arrange(summary, desc(countsperzip)))
## Source: local data frame [6 x 2]
## 
##     zip countsperzip
## 1 75217         6990
## 2 75243         5316
## 3 75216         5297
## 4 75228         5069
## 5 75211         4997
## 6 75220         4586