This assignment uses R to explore what data visualization can tell us about the patterns of crime in San Francisco in the Summer of 2014. It is only feasible to look at a few key dimensions, so the ones that have been chosen are the time of day, type of crime (broadly categorised), and spatial distribution of incidents. The exercise reveals some possible problems with data quality, so only tentative conclusions can be drawn.

Manipulating the data

The data provided comprised a .csv file of incidents occurring during June, July and August of 2014. Each record contained details including the type, date and time of occurrence, the Police District in which it took place, and the location of the event. This data was read into a dataframe called ‘Crimes’, and some basic pre-processing was carried out (e.g. to format the dates and times).

In all, there were 34 categories of incident, some of them with very few occurrences:

head(aggregate(Crimes["IncidntNum"],by=Crimes[c("Category")],FUN=length),5)
##             Category IncidntNum
## 1              ARSON         63
## 2            ASSAULT       2882
## 3            BRIBERY          1
## 4           BURGLARY          6
## 5 DISORDERLY CONDUCT         31

To keep the analysis simple, all categories with fewer than 1000 incidents were combined with the ‘Other Offenses’ category (which was then renamed ‘All Other Offenses’). This resulted in nine new ‘Broad Categories’, which is a more manageable number.

Caveat

One thing worth noting at this stage is that the file quite often contains two or more records with the same ‘IncidntNum’. These multiple occurrences always happen at exactly the same place and time, but the type and description of the crime may be quite different. Presumably this is because one incident may involve (say) a narcotics offense and an assault. Throughout this analysis, each record has been counted once, but that may mean that the total number of incidents is exaggerated.

Daily pattern of incidents

The histogram below shows all the records for the entire 3-month period, broken down by the time of day and broad category of incident.

# Stacked histogram for whole city
ggplot(Crimes,aes(x=Hour+0.5,fill= BroadCat)) + # Added 0.5 to ensure that the bar from 23 to 24 hours gets drawn
geom_histogram(breaks=seq(0, 24, by = 1),position="stack") +
theme(panel.background = element_blank(),
panel.margin = unit(0,"lines"),
legend.title = element_text(face="bold", size=rel(1.2)),
axis.text.x = element_text(face="bold",size=rel(1.2)),
axis.text.y = element_text(face="bold",size=rel(1.2)),
axis.title.x = element_text(face="bold",size=rel(1.2)),
axis.title.y = element_text(face="bold",size=rel(1.2))) +
labs(x="Hour of Day",y="Number of Incidents\n") +
scale_x_continuous(breaks = c(0,4,8,12,16,20,24), labels = c("00:00","04:00","08:00","12:00","16:00","20:00","24:00"),
      expand = c(0,0)) + # 'expand' prevents gap between vertical axis labels and histogram
scale_fill_brewer(palette="Set1") +
annotate("rect",xmin=0,xmax=4,ymin = 0, ymax = Inf,alpha=0.3,fill="gray") +
annotate("rect",xmin=4,xmax=6,ymin = 0, ymax = Inf,alpha=0.1,fill="gray") +
annotate("rect",xmin=21,xmax=23,ymin = 0, ymax = Inf,alpha=0.1,fill="gray") +
annotate("rect",xmin=23,xmax=24,ymin = 0, ymax = Inf,alpha=0.3,fill="gray") +
annotate("segment",x=6,xend=21,y=2200,yend=2200,colour = "grey",size=2,arrow=arrow(ends="both")) +
annotate("text",x=13.5,y=2300,label = "Daylight (approx)",colour="grey",size=6,fontface="bold",hjust=0.5) +
guides(fill = guide_legend(reverse = TRUE,title = "Broad Category",title.position = "top")) +
scale_y_continuous(expand = c(0,0)) +  # 'expand' prevents gap between horizontal axis labels and histogram
expand_limits(y=2500)

The grey shading at the right and left is intended as a reminder of which hours are likely to be completely dark during the summer, and which hours are in transition between daylight and darkness (although of course this is only approximate, and will vary over the 3-month period).

It can be seen that by far the biggest single category of incident is ‘Larceny/Theft’ (shown in green). Larceny/Theft also accounts for the greatest amount of variation over the 24-hour period, rising from very low levels in the early hours of the morning, to an early evening peak. There is also a pronounced peak at lunchtime, although the crimes committed then are more varied.

Breakdown by Police District

The ‘ggplot2’ package makes it easy to see how the daily pattern of crime varies between one Police District and another:

ggplot(Crimes,aes(x=Hour+0.5,fill= BroadCat)) + # Added 0.5 to ensure that the bar from 23 to 24 hours gets drawn
geom_histogram(breaks=seq(0, 24, by = 1),position="stack") +
facet_wrap(~ PdDistrict, ncol=4) +
theme(panel.background = element_blank(),
legend.title = element_text(face="bold"),
strip.text=element_text(face="bold",size=rel(1.2))) +
labs(x="Hour of Day",y="Number of Incidents") +
scale_x_continuous(breaks = c(0,4,8,12,16,20,24)) +
scale_fill_brewer(palette="Set1") +
annotate("rect",xmin=0,xmax=4,ymin = 0, ymax = Inf,alpha=0.3,fill="gray") +
annotate("rect",xmin=4,xmax=6,ymin = 0, ymax = Inf,alpha=0.1,fill="gray") +
annotate("rect",xmin=21,xmax=23,ymin = 0, ymax = Inf,alpha=0.1,fill="gray") +
annotate("rect",xmin=23,xmax=24,ymin = 0, ymax = Inf,alpha=0.3,fill="gray") +
guides(fill = guide_legend(reverse = TRUE,title = "Broad Category",title.position = "top"))

From this comparison, it is clear that Southern district has the greatest number of incidents overall (and especially of Larceny/Theft). This district also has the most extreme peaks and troughs in the course of the day. The other police districts which come closest to following this pattern are Central and Northern.

Spatial distribution

It is already clear that crime patterns vary a lot from one Police District to another, so it is natural to want to explore this in more detail on a map. The individual incidents can be pinpointed using the X- and Y-coordinates provided in the original file, and the outlines of the (pre-2015) San Francisco Police Districts can be downloaded from the San Francisco Data Portal.

Inspired by an example of crime mapping in London by Henry Partridge, I decided to try mapping the San Francisco crime in R using the ‘Leaflet’ package. The following code is all it takes to produce the interactive map below:

library(leaflet)
## Warning: package 'leaflet' was built under R version 3.2.5
library(RColorBrewer)
library(htmltools)

library(maptools)
library(scales)
library(rgeos)

PDistricts <- readShapePoly("geo_export_14f2cdaa-a8da-45bb-a34c-c6dbd8688433.shp")

pal <- colorFactor(palette = "Set1",domain = levels(Crimes$BroadCat))

leaflet(data=Crimes) %>% addTiles()  %>%  
  addCircleMarkers(clusterOptions = markerClusterOptions(),  
  lng= ~X, lat = ~Y, 
  color = ~pal(BroadCat), stroke = FALSE,fillOpacity = 1,
  popup = ~(paste(sep = "<br/>",htmlEscape(BroadCat),htmlEscape(Descript),htmlEscape(Date),
  htmlEscape(paste(formatC(hour(Time),width=2,flag="0"), ":", formatC(minute(Time), width=2, flag="0"), " hrs", sep=""))
  ))) %>%
  addPolygons(data=PDistricts,fillColor = "transparent",popup = ~district)

With over 28000 records, it is necessary to ‘cluster’ the markers, otherwise the map would be very cluttered and very slow. As you zoom in, however, individual markers start to become visible, colored according to the broad category of incident. If you click on one of these markers, a popup window appears containing details of the incident. The user can also click on the map background itself, to see a popup window identifying the Police District.

We have already established that Southern district (the one which extends across to an island) has the highest number of incidents in the city. If we zoom into the mainland part of Southern district, we soon notice a very large ‘cluster’ of incidents to the south of the freeway. When fully zoomed in, this cluster is still over 900-strong, and clicking on it again produces a veritable kaleidoscope of crimes, of every type imaginable.

Without local (and operational) knowledge, it is impossible to be sure exactly what is happening here. It is surely no coincidence that this huge cluster of incidents is on the corner of the block containing the Hall of Justice and the County Jail! But is this a real effect? Or have officers got into the habit of using this location as a convenient place to ‘park’ all the crimes whose true location is unknown?

Already the map has given rise to extra insights, which would never have become apparent if we had only looked at histograms or similar summary charts.

Data Quality

Now that our suspicions have been aroused, it is natural to be a bit more sceptical about the accuracy of other aspects of the data. If you didn’t know the true time of an event, but you had to enter something, what time would you pick? Noon, perhaps?

In the histogram below, the incidents recorded during each hour of the day are split into those logged as occurring exactly on the hour (in red), and all others (in blue). It is striking that no incidents are recorded at 00:00 hours exactly, but that is probably deliberate, as assigning a time of midnight would lead to arguments about the date of the offense.

At other times of the day, a strikingly high proportion of offences are logged as occurring exactly on the hour. And that proportion is highest of all (44%) for incidents in the hour starting at noon. If we take the data at face value, almost all of the extra surge of incidents in that hour are happening right on the stroke of midday. Suddenly, that ‘lunchtime peak’ doesn’t seem quite so convincing after all…

Conclusions

The exploratory analysis of this data has revealed possible issues regarding data quality, which would have to be investigated before being able to draw serious conclusions about patterns of crime either in space or in time. The exercise has, however, demonstrated the insights which can be gained from interactive mapping.

Once the data issues have been ironed out, it would be worth taking interactive mapping with Leaflet to the next level, by implementing it on RStudio’s ‘Shiny’ platform. This would give the developer more scope to link maps and charts if desired, and provide users with more ways to slice, dice and filter the data ‘on the fly’, and thus obtain a deeper understanding of what is really going on.