This document shows the steps taken to get the number of crimes per zip code for 2014 in Dallas County.
This analysis uses Dallas Open Data (https://www.dallasopendata.com/) to calculate the number of crimes per zip code in Dallas County. As pointed out on Dallas Police Public Data website (http://www.dallaspolice.net/publicdata/), the data that the police supply to the public is sample data, so the data cannot be used to supply official statistics.
To get a list of Dallas County zip codes, I used http://www.unitedstateszipcodes.org/ and used just the zip codes needed for this example analysis.
Add required packages.
library(lubridate)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Get the crime data from dallasopendata.com
crime.data.file <- "crime.csv"
if(!file.exists(crime.data.file)){
download.file("http://www.dallasopendata.com/api/views/ftja-9jxd/rows.csv?accessType=DOWNLOAD",
destfile=crime.data.file)
}
Disclaimer: The data supplied by Dallas Police Department is sampled and should not be used for statistical purposes, but we should be able to extract enough informaation to get a general idea of where crime is concentrated.
The Dallas Police Department implemented a new Records Management System (RMS) on June 1, 2014. To get crime data for 2014, two datasets are needed.
rms.file <- "rms.csv"
if(!file.exists(rms.file)){
download.file("http://www.dallasopendata.com/api/views/tbnj-w5hb/rows.csv?accessType=DOWNLOAD",
destfile=rms.file)
}
Read both parts of the the crime data into a data.frame.
crime.data.part1 <- read.csv(crime.data.file,
as.is = TRUE)
crime.data.part2 <- read.csv(rms.file,
as.is = TRUE)
Get the columns that are needed for this analysis.
crime.data <- dplyr::select(crime.data.part1, offensedate, offensetimedispatched, offensezip)
temp <- dplyr::select(crime.data.part2, Date1, Time1, ZipCode)
Change names of the columns to match the first set of data
colnames(temp) <- c("offensedate", "offensetimedispatched", "offensezip")
Remove data that is not in the year 2014 and is before June 1, 2014 from the RMS set of data.
temp <- mutate(temp, tempdate = as.Date(temp$offensedate,
format="%m/%d/%Y"))
temp <- temp[as.Date(temp$tempdate) >= as.Date("2014-06-01")
& year(temp$tempdate) == 2014
& !is.na(temp$tempdate),]
Remove the tempdate column
tempdateindex <- grep("^tempdate$", colnames(temp))
temp <- temp[,-tempdateindex]
Bind the two data sets
crime.data <- rbind(crime.data, temp)
Check our date range of the data
crime.data$offensedate <- as.Date(crime.data$offensedate,
format="%m/%d/%Y")
paste("Min is ", min(crime.data$offensedate), sep=" ")
## [1] "Min is 1994-03-15"
paste("Max is ", max(crime.data$offensedate), sep=" ")
## [1] "Max is 2014-12-31"
Check if the data is what is expected
crime.data <- mutate(crime.data, offenseyear = year(crime.data$offensedate))
crime <- group_by(crime.data, offenseyear)
summarize(crime, countsperyear = length(offenseyear))
## Source: local data frame [20 x 2]
##
## offenseyear countsperyear
## 1 1994 1
## 2 1995 1
## 3 1996 1
## 4 1997 1
## 5 1998 1
## 6 2000 2
## 7 2001 4
## 8 2002 4
## 9 2003 12
## 10 2004 21
## 11 2005 14
## 12 2006 14
## 13 2007 23
## 14 2008 33
## 15 2009 31
## 16 2010 60
## 17 2011 80
## 18 2012 270
## 19 2013 24124
## 20 2014 108900
Most observations happen in 2014, so it appears that data is not available for other years
Get data for 2014
crime.data <- crime.data[crime.data$offenseyear == 2014,]
Check the distribution per month
crime.data <- mutate(crime.data, offensemonth = month(crime.data$offensedate))
crime.data.month <- group_by(crime.data, offensemonth)
summarize(crime.data.month, countspermonth = length(offensemonth))
## Source: local data frame [12 x 2]
##
## offensemonth countspermonth
## 1 1 10739
## 2 2 9871
## 3 3 11460
## 4 4 12074
## 5 5 12177
## 6 6 6971
## 7 7 7780
## 8 8 7511
## 9 9 7288
## 10 10 8040
## 11 11 7365
## 12 12 7624
For 2014, what are the zip codes with the most crimes?
I was able to get the dallas zip codes and lat longs of the zip codes from http://www.unitedstateszipcodes.org/
dallas.zips <- read.csv("zip_code_database.csv")
zip.index <- grep("^offensezip$", colnames(crime.data))
colnames(crime.data)[zip.index] <- "zip"
merged.data <- merge(crime.data, dallas.zips, by = "zip")
crime.data.zip <- group_by(merged.data, zip)
summary <- summarise(crime.data.zip,
countsperzip = length(zip) )
head(arrange(summary, desc(countsperzip)))
## Source: local data frame [6 x 2]
##
## zip countsperzip
## 1 75217 6990
## 2 75243 5316
## 3 75216 5297
## 4 75228 5069
## 5 75211 4997
## 6 75220 4586