Many cities are now providing city data for public viewing online. Virginia Beach, VA is one of them.
As a data science enthusiast, I wanted to learn a little bit more about the data.
The Virginia Beach Police Calls for Service dataset looked like a good place to start.
Even though this web site provides tools for basic exploration of the data, I wanted to take it a step further. So let’s see what else we can learn from the data.
While reading articles on R-bloggers, I stumbled upon the term “Descriptive Analytics”. This lead me to a series of exercises by Vasileios Tsakalos. Descriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?” The goal is to inform the user about what is going on at the dataset. I liked how he structured his exercises and his analysis. I found it clear and easy to understand and based this analysis on it.
Let’s get to work.
Download the data in JSON format. For this example, I’m working with a local copy of the file.
This article on ZevRoss provides an overview on parsing JSON data from an open data portal.
On this site, the author used RJSONIO’s fromJSON to read the JSON file. I found the extraction of the data especially the location information rather complex. After some investigating, I opted to give rList’s list.load a try.
#vb_data <- list.load("https://data.vbgov.com/api/views/7gep-fpmg/rows.json?accessType=DOWNLOAD")
#vb_data <- fromJSON("./data/vb_police_calls.json", simplifyVector=FALSE)
vb_data_raw <- list.load("./data/vb_police_calls.json")
The JSON data file is composed of 2 sections: meta, data.
The meta section contains descriptive and administrative metadata about the JSON file while the data section contains the actual information to analyze.
Let’s learn a little bit about the JSON file.
info <- vb_data_raw[[names(vb_data_raw['meta'])]]
# JSON meta columns
names(info$view)
## [1] "id" "name"
## [3] "attribution" "averageRating"
## [5] "category" "createdAt"
## [7] "description" "displayType"
## [9] "downloadCount" "hideFromCatalog"
## [11] "hideFromDataJson" "indexUpdatedAt"
## [13] "licenseId" "newBackend"
## [15] "numberOfComments" "oid"
## [17] "provenance" "publicationAppendEnabled"
## [19] "publicationDate" "publicationGroup"
## [21] "publicationStage" "rowClass"
## [23] "rowsUpdatedAt" "rowsUpdatedBy"
## [25] "tableId" "totalTimesRated"
## [27] "viewCount" "viewLastModified"
## [29] "viewType" "columns"
## [31] "grants" "license"
## [33] "metadata" "owner"
## [35] "query" "rights"
## [37] "tableAuthor" "tags"
## [39] "flags"
# show description of JSON file
info$view$description
## [1] "This dataset includes information on calls for service to the Virginia Beach Police Department. Calls for service are requests from citizens for police assistance. Calls can range from minor problems in the neighborhood (traffic complaints, loud neighbors, and graffiti) to the most serious crimes (burglaries, robberies, and homicides). Call descriptions are based upon the information received at the time of dispatch."
# show license
info$view$license
## $name
## [1] "Public Domain"
# show type of data
info$view$category
## [1] "Public Safety"
# show JSON file date, e.g. indexUpdateAt
as.POSIXct(info$view$indexUpdatedAt, origin="1970-01-01")
## [1] "2017-10-14 08:39:35 EDT"
The relevant columns are as follows:
| Index | Column Name | Description |
|---|---|---|
| 9 | Incident Number | This number is generated by the computer-aided dispatch system (CAD) and provides a unique ID number for each call for service. |
| 10 | Report Number | The report number assigned to the call by the Virginia Beach Police Department. Only incidents that result in a police report will be assigned a report number. |
| 11 | Call Type | Provides the type of call as described by the individual making the call for service. Upon investigation by responding officers, the original call type may be amended and captured as the “Final Call Type.” |
| 12 | Zone | Reporting District ID used by the Virginia Beach Police Department. |
| 13 | Call Disposition | This describes how the call was resolved. A list of all case dispositions can be downloaded on the dataset information page. |
| 14 | Priority | There are five priority codes depending on the severity of the incident. For a description of all priority codes, download the ‘Priority Descriptions’ document on the dataset information page. |
| 15 | Subdivision | These are subdivisions or neighborhood areas that have been delineated by the Virginia Beach Police Department. To view the map, download the ‘Subdivision Map’ file on the dataset information page. |
| 16 | Call Date | The date and time the call was received. |
| 17 | Entry Date | The time the call was entered into the computer-aided dispatch (CAD) system. |
| 18 | Dispatch Date | The time the call was initially dispatched. |
| 19 | Enroute Date | Represents the first en route time captured. If multiple units are responding, this time represents the first unit classified as en route. |
| 20 | On Scene Date | The time the first unit arrives on scene. |
| 21 | Close Date | Time that the call is concluded. |
| 22 | Location Info | This includes approximate (down to the street block level) location information where the incident occurred. |
The rList’s list.select provides a straightforward method to extract the crime data. It also performs the action as a single operation unlike in the ZevRoss example.
# extract the relevant columns into a data frame plus lat/lng from location info
crimesDF <- vb_data_raw[['data']] %>>%
list.select(.[9]
, .[10]
, .[11]
, .[12]
, .[13]
, .[14]
, .[15]
, .[16]
, .[17]
, .[18]
, .[19]
, .[20]
, .[21]
, (.[22] %>>% list.map(.[2]))
, (.[22] %>>% list.map(.[3]))) %>>%
list.stack()
# set the column names
colnames(crimesDF) <- c("incident_num"
, "report_num"
, "call_type"
, "zone"
, "call_disposition"
, "priority"
, "subdivision"
, "call_date"
, "entry_date"
, "dispatch_date"
, "enroute_date"
, "onscene_date"
, "close_date"
, "lat"
, "lng")
In statistics, imputation is the process of replacing missing data with substituted values.
Clean up NULL values. Ensure numeric fields are numeric. Ensure data fields are valid and date fields consistent.
# init NULL report numbers to -1
crimesDF$report_num[crimesDF$report_num == "NULL"] <- -1
# init NULL call dispositions to <UNDEF>
crimesDF$call_disposition[crimesDF$call_disposition == "NULL"] <- "<UNDEF>"
# init NULL call types to -1
crimesDF$call_type[crimesDF$call_type == "NULL"] <- -1
# init NULL subdivisions to <UNDEF>
crimesDF$subdivision[crimesDF$subdivision == "NULL"] <- "<UNDEF>"
# format incident number, report number, lat/lng columns as number
crimesDF[, c(1,2,14:15)] <- lapply(crimesDF[, c(1,2,14:15)], as.numeric)
# format date columns
dateFormat <- "%Y-%m-%dT%H:%M:%S"
dateCols <- names(crimesDF[,grep("_date", names(crimesDF))])
crimesDF[, dateCols] <- lapply(crimesDF[, dateCols]
, function(x) as.POSIXct(strptime(as.character(x)
, format=dateFormat)
, format=dateFormat))
# define factor variables
crimesDF[c(4)] <- factor(unlist(crimesDF[[c(4)]]))
crimesDF[c(6)] <- factor(unlist(crimesDF[[c(6)]]))
#### Column names
names(crimesDF)
## [1] "incident_num" "report_num" "call_type"
## [4] "zone" "call_disposition" "priority"
## [7] "subdivision" "call_date" "entry_date"
## [10] "dispatch_date" "enroute_date" "onscene_date"
## [13] "close_date" "lat" "lng"
#### Basic structure
glimpse(crimesDF)
## Observations: 976,699
## Variables: 15
## $ incident_num <dbl> 151590600, 160250724, 170130295, 162170127, 1...
## $ report_num <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...
## $ call_type <list> ["INFORMATION", "TRAFFIC STOP", "ASSIST CITI...
## $ zone <fctr> 427, 126, 328, 122, 125, 324, 321, 421, 225,...
## $ call_disposition <list> ["Miscellaneous/Other", "Miscellaneous/Other...
## $ priority <fctr> 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 3, 4,...
## $ subdivision <list> ["WINDSOR OAKS WEST", "CARRIAGE HILL/BROOKWO...
## $ call_date <dttm> 2015-06-08 16:06:23, 2016-01-25 21:44:36, 20...
## $ entry_date <dttm> 2015-06-08 16:06:23, 2016-01-25 21:44:36, 20...
## $ dispatch_date <dttm> 2015-06-08 16:06:23, 2016-01-25 21:44:36, 20...
## $ enroute_date <dttm> 2015-06-08 16:06:23, 2016-01-25 21:44:36, 20...
## $ onscene_date <dttm> 2015-06-08 16:06:23, 2016-01-25 21:44:36, 20...
## $ close_date <dttm> 2015-06-08 16:52:54, 2016-01-25 21:55:48, 20...
## $ lat <dbl> 36.82396, 36.83911, 36.86787, 36.73692, NA, 3...
## $ lng <dbl> -76.10523, -76.06816, -76.13274, -76.02077, N...
set.seed(29)
#### random data sample
crimesDF %>% sample_n(5)
## incident_num report_num call_type zone
## 97359 153450886 -1 OFFICER INITIATED INVESTIGATION 223
## 235289 171730057 -1 TRAFFIC STOP 420
## 100809 161670505 -1 CELLULAR 911 HANG-UP CALL 427
## 317997 170831022 -1 TRAFFIC STOP 228
## 571356 152780216 15041418 LARCENY 327
## call_disposition priority subdivision
## 97359 No Report 5 HILLTOP RIDGE APTS
## 235289 Miscellaneous/Other 3 COLLEGE PARK SQUARE SHPG CTR
## 100809 No Report 3 BEHIND WINDSOR WOODS IN WINDSOR WOODS
## 317997 Miscellaneous/Other 3 SEATACK
## 571356 Offense Report 4 BAYSIDE HIGH SCHOOL
## call_date entry_date dispatch_date
## 97359 2015-12-11 22:10:28 2015-12-11 22:10:28 2015-12-11 22:10:28
## 235289 2017-06-22 01:41:40 2017-06-22 01:41:40 2017-06-22 01:41:40
## 100809 2016-06-15 14:54:21 2016-06-15 14:54:57 2016-06-15 14:55:43
## 317997 2017-03-24 22:42:30 2017-03-24 22:42:30 2017-03-24 22:42:30
## 571356 2015-10-05 10:10:26 2015-10-05 10:10:26 2015-10-05 10:10:26
## enroute_date onscene_date close_date
## 97359 2015-12-11 22:10:28 2015-12-11 22:10:28 2015-12-11 23:59:27
## 235289 2017-06-22 01:41:40 2017-06-22 01:41:40 2017-06-22 01:49:40
## 100809 1899-12-30 00:00:00 1899-12-30 00:00:00 2016-06-15 14:55:46
## 317997 2017-03-24 22:42:30 2017-03-24 22:42:30 2017-03-24 22:46:56
## 571356 2015-10-05 10:10:26 2015-10-05 10:10:26 2015-10-05 10:10:42
## lat lng
## 97359 36.84947 -76.01617
## 235289 NA NA
## 100809 36.83401 -76.09659
## 317997 36.81512 -75.99707
## 571356 36.87003 -76.14763
Use the Nominatim API to extract extra location information. Other API’s are available. But this one provides detailed neighborhood information, ZIP codes, and was affordable (FREE).
###################################################
# Helper functions to format OpenStreetMap API
# request.
###################################################
formatReq <- function(lat, lon) {
paste0(
"lat=",
lat,
"&lon=",
lon,
"&zoom=18&addressdetails=1"
, apiEmail
)
}
getLocationDetail <- function(lat, lon) {
URL <- "http://nominatim.openstreetmap.org/reverse?format=json&"
#response <- GET(paste0(URL, formatReq(lat, lon)))
response <- http_error(GET(paste0(URL, formatReq(lat, lon))))
if (!response) {
json_data <- jsonlite::fromJSON(paste0(URL, formatReq(lat, lon)), simplifyVector = FALSE)
Sys.sleep(2) # added to alleviate [to a degreee] timeout problems
return (json_data)
}
}
A local Postgres database was used to cache location information from Nominatim OpenStreetMap. This reduces the number of Nominatim API calls.
The Postgres table to store this information is defined as follows:
CREATE TABLE location_info (
ID serial NOT NULL PRIMARY KEY,
lat NUMERIC(8,6) NOT NULL,
lng NUMERIC(9,6) NOT NULL,
info json NOT NULL
);
ALTER TABLE public.location_info
ADD CONSTRAINT location_info_uk1 UNIQUE (lat, lng);
Extract location information from the Postgres database.
options(java.parameters = "-Xmx8048m")
drv = JDBC("org.postgresql.Driver", "../../lib/postgresql-9.4.1212.jar")
con <- dbConnect(drv
, url="jdbc:postgresql://localhost:5432/db"
, user="postgres"
, password="docker")
getLocInfoFromDB<-function(lat, lng) {
# When the JSON result is parsed, the returned value is formatted as: ["return_value"]. Remove this
# format before using.
if (complete.cases(lat, lng)) {
results <- dbGetQuery(con
, paste0("SELECT regexp_replace(info -> 'address' ->> 'hamlet'"
, ", '[\"\\[\\]]', '', 'g') AS neighborhood "
, ", regexp_replace(info -> 'address' ->> 'postcode'"
, ", '[\"\\[\\]]', '', 'g') AS zip "
, "FROM location_info "
, "WHERE lat=", lat
, " AND lng=", lng))
} else {
results <- "|"
}
return(results)
}
#### number of rows with no locations
nrow(crimesDF %>% filter(is.na(lat) | is.na(lng)))
## [1] 148883
#### sample of the rows with no lat/lng
crimesDF %>%
select(subdivision, lat, lng) %>%
filter(is.na(lat) | is.na(lng)) %>%
sample_n(10)
## subdivision lat lng
## 13860 PINE HURST ESTATES NA NA
## 123300 COLLEGE PARK NA NA
## 128977 FIRST LANDING STATE PARK AREA NA NA
## 17865 BAYSIDE BOROUGH NA NA
## 34711 <UNDEF> NA NA
## 146287 BAYLAKE PINES AREA NA NA
## 58524 OCEANFRONT - 31ST ST SOUTH NA NA
## 44818 THOROUGHGOOD NA NA
## 94010 PRINCESS ANNE PLAZA NA NA
## 26228 OCEANFRONT - 31ST ST NORTH NA NA
# add location info [from database] to crimes data frame
crimesDF.zip <- left_join(crimesDF.loc, distinct_locs, by = c("lat","lng"))
# number of remaining records to process
crimesDF.zip %>%
select(lat, lng, zip, neighborhood, subdivision) %>%
filter(grepl("^234", zip)) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 771969
# number of non Virginia Beach zip codes (invalid lat/lng)
crimesDF.zip %>%
select(lat, lng, zip, neighborhood, subdivision) %>%
filter(!grepl("^234", zip)) %>%
distinct() %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 55847
# sample of non Virginia Beach zip codes
crimesDF.zip %>%
select(lat, lng, zip, neighborhood, subdivision) %>%
filter(!grepl("^234", zip)) %>%
sample_n(10)
## lat lng zip neighborhood
## 46209 36.78291 -75.96000 character(0) character(0)
## 37000 36.85538 -76.02489 character(0) character(0)
## 20223 36.76113 -76.00533 character(0) character(0)
## 48403 40.27180 -75.22031 18914 NA
## 20251 36.80352 -76.20816 character(0) character(0)
## 37185 39.94620 -82.98280 43206 NA
## 34528 36.82464 -76.14902 character(0) character(0)
## 54326 26.11609 -97.64632 78586 NA
## 54852 35.81262 -78.78464 27513 NA
## 34714 36.78025 -75.98186 character(0) character(0)
## subdivision
## 46209 DAM NECK
## 37000 OCEANA
## 20223 STRAWBRIDGE OFFICE PARK IN STRAWBRIDGE
## 48403 BELLAMY PLANTATION
## 20251 LEVEL GREEN
## 37185 ARAGONA VILLAGE
## 34528 KEMPSVILLE LAKES
## 54326 <UNDEF>
## 54852 <UNDEF>
## 34714 PINE MEADOWS
# sample of the types of call by ZIP code
crimesDF.zip %>%
select(zip, neighborhood, subdivision, call_type, call_disposition) %>%
filter(grepl("^234", zip)) %>%
sample_n(10) %>%
arrange(zip)
## zip neighborhood subdivision
## 1 23451 NA OCEANFRONT - 31ST ST SOUTH
## 2 23451 NA OCEANFRONT - 31ST ST SOUTH
## 3 23451 NA HILLTOP
## 4 23454 Southern Points WOLFSNARE PLANTATION
## 5 23454 Oceana OCEANA GARDENS
## 6 23455 Smith Lake Terrace LAKE SMITH TERRACE WEST
## 7 23456 Colony Acres MUNICIPAL CENTER
## 8 23456 Hills Corner HERON RIDGE ESTATES
## 9 23462 Pembroke Manor BAYSIDE BOROUGH
## 10 23464 Lakeville Estates WOODSTOCK ELEMENTARY SCHOOL
## call_type call_disposition
## 1 DISABLED VEHICLE, WRECKER REQUIRED Miscellaneous/Other
## 2 OFFICER INITIATED INVESTIGATION No Report
## 3 TRAFFIC STOP Miscellaneous/Other
## 4 ASSIST CITIZEN Miscellaneous/Other
## 5 DISPUTE - BOYFRIEND/GIRL FRIEND No Report
## 6 ANIMAL TRAP Miscellaneous/Other
## 7 WARRANT TRANSFER Arrest Report
## 8 ASSIST FIRE DEPARTMENT Miscellaneous/Other
## 9 BURGLAR ALARM No Report
## 10 CELLULAR 911 HANG-UP CALL Case Cancelled
# call disposition breakdown
crimesDF.zip %>%
select(incident_num, call_disposition) %>%
count(unlist(call_disposition))
## # A tibble: 15 x 2
## `unlist(call_disposition)` n
## <chr> <int>
## 1 <UNDEF> 34
## 2 Accident Report Made 10093
## 3 Animal Bite Report 3772
## 4 Arrest Report 26212
## 5 Burglar Alarm Report 37672
## 6 Case Cancelled 100682
## 7 Duplicate Call 12876
## 8 Miscellaneous/Other 345991
## 9 No Report 184347
## 10 Offense Report 75450
## 11 Picked Up Animal 17901
## 12 Public Utilities 2208
## 13 Public Works 1261
## 14 Unable to Locate 9245
## 15 Unknown/Uncoded 72
# number of 911 calls
crimesDF.zip %>%
select(call_type) %>%
filter(grepl("911", call_type)) %>%
count(unlist(call_type))
## # A tibble: 2 x 2
## `unlist(call_type)` n
## <chr> <int>
## 1 911 HANG UP CALL 16394
## 2 CELLULAR 911 HANG-UP CALL 110409
# distribution of Virginia Beach ZIP codes - begin with 234
ggplot(crimesDF.zip %>%
select(zip) %>%
filter(grepl("^234", zip))
, aes(zip)) +
geom_bar() +
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=0)) +
labs(title = "Virginia Beach Crime Incidents By Zip Code", x = "Zip Code", y = "Number of Incidents")
# distribution of crime types
ggplot(crimesDF.zip %>%
select(priority, zip) %>%
filter(grepl("^234", zip)) %>%
select(priority)
, aes(priority)) +
geom_bar() +
labs(title = "Virginia Beach Crime Incidents By Priority", x = "Priority", y = "Number of Incidents")
# heat map illustrating types of crimes by year and qtr
# call disposition breakdown (by year and qtr)
crimes.yrmo <- crimesDF.zip %>%
transmute(incident_num
, call_disposition
, date_entered=paste0(year(entry_date)
, '-'
, sprintf("%02d", month(entry_date)))) %>%
count(disposition=unlist(call_disposition), date_entered)
ggplot(crimes.yrmo, aes(x=date_entered, y=disposition)) +
geom_raster(aes(fill=n), hjust=0.5, vjust=0.5, interpolate=TRUE) +
scale_fill_gradient(name="Number Of Incidents", low = 'yellow', high='red') +
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=0)) +
labs(title="Crime Incidents", x="Entry Date", y="Disposition")
ggplot(crimes.yrmo, aes(x=date_entered, y=n, fill=disposition)) +
geom_bar(stat="identity") +
#scale_y_continuous(labels=comma) +
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=0)) +
guides(fill=guide_legend(title="Dispositions")) +
labs(title="Crime Incidents", x="Entry Date", y="Number of Incidents")
crimes.zone <- crimesDF %>%
transmute(zone
, priority
, entry_date
, entry_mo=month(entry_date)
, entry_yr=year(entry_date)
, entry_qtr=quarter(entry_date, with_year=TRUE)) %>%
group_by(zone, priority, entry_mo, entry_yr, entry_qtr) %>%
count()
ggplot(data=crimes.zone %>%
select(priority, n, entry_qtr)
, aes(x=entry_qtr, y=n, color=zone)) +
# geom_violin(trim = FALSE) +
stat_summary(fun.y = sum, geom = "line") + # avg line
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=0)) +
guides(color=guide_legend(title="Incidents By Zone")) +
labs(title="Quarterly Incidents By Zone", x="Quarter", y="Number Of Incidents") +
facet_wrap(~zone)
rm (crimes.yrmo, crimes.zone)
Let’s explore police response time. Calculate response time as onscene_date-dispatch_date.
crimesDF.response <- crimesDF %>%
transmute(priority
, call_type
, response_time=onscene_date-dispatch_date
, entry_mo=month(entry_date)
, entry_yr=year(entry_date)
, entry_qtr=quarter(entry_date, with_year=TRUE)) %>%
filter(response_time > 0)
ggplot(data=crimesDF.response %>%
select(priority, response_time, entry_qtr)
, aes(x=paste(entry_qtr), y=as.numeric(response_time), color=priority)) +
scale_y_log10() +
geom_violin(trim = FALSE) +
geom_boxplot(width=0.1) +
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=0)) +
guides(color=guide_legend(title="Call Priority")) +
labs(title="Response Time By Call Priority"
, x="Call Priority"
, y="Response Time (sec)") +
facet_wrap(~priority)
## Warning in max(data$density): no non-missing arguments to max; returning -
## Inf
Group the number of incidents depending upon the map’s zoom level. The number inside each circle represents the total number of incidents in that area. When you click on a cluster, the map will automatically zoom into that area and split into smaller clusters or show the individual incidents depending on how zoomed in you are. Small circles represent individual incident reports.
Display a 2017 incident map by zip code. Represent the Virginia Beach Police Stations by blue markers. Display zip code, zone, priority, call type and location information if available.
crimeGrp <- full_join(crimesDF.loc, distinct_locs, by = c("lat","lng")) %>%
select(lat, lng, priority, call_type, neighborhood, subdivision, zip, zone, entry_date) %>%
filter(year(entry_date) == 2017) %>%
group_by(lat
, lng
, priority
, unlist(call_type)
, neighborhood
, unlist(subdivision)
, zip
, zone) %>%
count()
colnames(crimeGrp) <- c("lat"
, "lng"
, "priority"
, "call_type"
, "neighborhood"
, "subdivision"
, "zip"
, "zone"
, "cnt")
# zip code overlay gone 11/2017
m.zip <- leaflet(crimeGrp) %>%
addTiles() %>%
# addGeoJSON(zipcodes) %>% # zip code overlay
addMarkers(lng=vbPrecincts$lng, lat=vbPrecincts$lat, popup=vbPrecincts$descr) %>%
addCircleMarkers(clusterOptions = markerClusterOptions()
, popup=paste("<b>("
, crimeGrp$zip
, "-"
, crimeGrp$zone
, "-"
, crimeGrp$priority
, ")</b> "
, "-"
, crimeGrp$call_type
, "<br/>"
, crimeGrp$neighborhood
, "/"
, crimeGrp$subdivision)) %>%
setView(-76.1339487, 36.8464062, 11)
m.zip # Print the map