Summary

Specialisation Data Science at Scale
Course Communicating Data Science Results
Education Institution Washington University
Publisher Coursera
Assignment Crime Analytics: Visualisation of Incident Reports

Reproducibility - R Markdown

This is a R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see rmarkdown.rstudio.com.

Using the Knit package a document is generated and includes both content as well as the output of any embedded R code chunks within the document. More about knitr at yihui.name/knitr/.

This document is publish at RPubs. RPubs is a service to publish R Markdown documents on the web. Prerequisites are R itself, RStudio (v0.96.230 or later), and the knitr package (v0.5 or later). More details on RPubs at rpubs.com.

Introduction

Analysis

The investigation will be an initial Exploratory Data Analysis focusing on visualisation. It will follow a What, When, Where approach.

Data

The data set used in this report is about Police Incident Reports of the city of San Francisco, on the Summer 2014. The data set is provided by Coursera.

Code

This report was produced using R and the R packages stringi, reshape2, ggplot2, ggthemes and ggmap.

Although the R code is intermixed with the report so the calculations and data/variables are available, the code presentation is hold to the end to avoid cluttering of the text, charts and code in the main part of the report.

Exploratory Data Analysis

Loading the data

What

The initial question is: What is happening?

The variables that can help answering this question are Category, Descript and Resolution.

The variable Category has 34 levels.

The top category of incidents is Larceny/theft with 9,466, followed by Other Offenses (3,567) and Non-criminal (3,023).

A next step would be to break down each or the top categories by their descriptions to understand them in more details.

The variable Resolution has 16 levels.

The top resolution of incidents is None with 19,139, followed by Arrest, Booked (6,502) and Arrest, Cited (1,419).

When

Another question is: When is it happening?

The variables that can help answering this question are DayOfWeek, Date and Time.

Crossing the Day of the Week and the Hour, can help to understand the most typical hours when the incidents happen.

It can be seen that during the week the incidents occur more in the evenings (17 to 18 hours, stretching to 20 hours, and a second concentration around 12 hours). During weekends and including Fridays, the incidents are concentrated in the evening and then at late hours, around 23 hours.

Further analysis could investigate the incidents by category or resolution as well as with location.

Where

Finally the question becomes: Where is it happening?

The variables that can help answering this question are PdDistrict, Address, X and Y and Location.

As the geographical coordinates are available, it is possible to identify precisely the location of the incidents and relate to categories and periods. Larceny/theft and somewhat of Warrants are more predominant in the bay area, while Vehicle Theft are spread across the city.

Assault Drug/narcotic Larceny/theft Vehicle Theft Warrants
2882 1345 9466 1966 1782

When filtering Larceny/Theft only and checking for the Day of the Week, there is a concentration in the bay are and on Fridays and weekends.

Monday Tuesday Wednesday Thursday Friday Saturday Sunday
1233 1237 1228 1235 1445 1583 1505

Appendix A

Code

# load packages
require(stringi)  # Character String Processing Facilities
require(reshape2) # Flexibly Reshape Data
require(ggplot2)  # An Implementation of the Grammar of Graphics
require(ggthemes) # Extra Themes, Scales and Geoms for ggplot2
require(ggmap)    # Spatial Visualization with ggplot2

# Functions
# - Capitalise
capwords <- function(s, strict = FALSE) {
  cap <- function(s) paste(toupper(substring(s, 1, 1)),
                           {s <- substring(s, 2); if (strict) tolower(s) else s},
                           sep      = "", 
                           collapse = " ")
  sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}

# Load San Francisco's data set
sf <- read.csv(file             = "sanfrancisco_incidents_summer_2014.csv",
               stringsAsFactors = FALSE)

# convert Category to factors (Nominal) variable
sf$Category   <- factor(capwords(tolower(sf$Category)))

# reduce Category to a table of frequencies
Data           <- as.data.frame(table(sf$Category), stringsAsFactors = FALSE)
# change columns' names
colnames(Data) <- c("Category", "Frequency")
# order by decresing frequency
Data           <- Data[order(Data$Frequency, decreasing = TRUE), ]
# force the new order as factor, necessary to force the order in the chart
Data$Category  <- factor(Data$Category, levels = Data$Category)
# filter to the top 15
Data           <- Data[1:15, ]

# create the chart for category 
g <- ggplot(Data)
g <- g + geom_histogram(aes(x = Category, y = Frequency, fill = Frequency), stat = "identity")
g <- g + ggtitle(expression(atop("What is Happening", atop(italic("Top 15 Category of Incidents (Summer 2014)"), ""))))
g <- g + scale_fill_continuous(low = "orange", high = "red")
g <- g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

# convert Resolution to factors (Nominal) variable
sf$Resolution <- factor(capwords(tolower(sf$Resolution)))

# reduce Resolution to a table of frequencies
Data            <- as.data.frame(table(sf$Resolution), stringsAsFactors = FALSE)
# change columns' names
colnames(Data)  <- c("Resolution", "Frequency")
# order by decresing frequency
Data            <- Data[order(Data$Frequency, decreasing = TRUE), ]
# force the new order as factor, necessary to force the order in the chart
Data$Resolution <- factor(Data$Resolution, levels = Data$Resolution)
# filter to the top 15
Data            <- Data[1:5, ]

# create the chart for resolution
g <- ggplot(Data)
g <- g + geom_histogram(aes(x = Resolution, y = Frequency, fill = Frequency), stat = "identity")
g <- g + ggtitle(expression(atop("What is Happening", atop(italic("Top 5 Resolutions of Incidents (Summer 2014)"), ""))))
g <- g + scale_fill_continuous(low = "orange", high = "red")
g

# convert DayOfWeek to factors (Nominal) variable and force calendar order
sf$DayOfWeek <- factor(sf$DayOfWeek, 
                       levels = c("Monday", "Tuesday",  "Wednesday", "Thursday", 
                                  "Friday", "Saturday", "Sunday"))
sf$Hour      <- stri_replace_all_regex(str         = sf$Time,
                                       pattern     = "([0-2][0-9]).*",
                                       replacement = "$1")
sf$Hour      <- factor(sf$Hour)

# reduce DayOfWeek and Hour to a table of frequencies
Data            <- dcast(sf, DayOfWeek + Hour ~ .)
# change columns' names
colnames(Data)  <- c("DayOfWeek", "Hour", "Frequency")

# create the chart for resolution
g <- ggplot(Data)
g <- g + geom_point(aes(x = DayOfWeek, y = Hour, size = Frequency, colour = Frequency), stat = "identity")
g <- g + ggtitle(expression(atop("When it is Happening", atop(italic("Hour by Day of the Week of Incidents (Summer 2014)"), ""))))
g <- g + scale_colour_continuous(low = "yellow", high = "red")
g

sf$Latitude   <- sf$Y
sf$Longitude  <- sf$X

# filter top 5 relevant categories
sf_sub          <- subset(sf, Category %in% c("Larceny/theft", "Assault", 
                                              "Vehicle Theft", "Warrants", 
                                              "Drug/narcotic"))
sf_sub$Category <- factor(sf_sub$Category)

# reduce Location to a table of frequencies
Data            <- dcast(sf_sub, Latitude + Longitude + Category ~ .)
colnames(Data)  <- c("Latitude", "Longitude", "Category", "Frequency")

# create the chart for Location / Categories
g <- qmplot(Longitude, Latitude, data = Data, color = Category, size = I(1.5),
            maptype = "toner-lite")
g <- g + scale_colour_brewer(type = "div", palette = "Accent")
g <- g + ggtitle(expression(atop("Where it is Happening", atop(italic("Location by Top Categories of Incidents (Summer 2014)"), ""))))
g

kable(t(table(sf_sub$Category)))

# filter top category
sf_sub          <- subset(sf, Category == "Larceny/theft")

# reduce Location to a table of frequencies
Data            <- dcast(sf_sub, Latitude + Longitude + DayOfWeek ~ .)
colnames(Data)  <- c("Latitude", "Longitude", "DayOfWeek", "Frequency")

# create the chart for Location / Categories
g <- qmplot(Longitude, Latitude, data = Data, 
            color = DayOfWeek, size = I(1.5), maptype = "toner-lite")
g <- g + scale_colour_brewer(type = "div", palette = "BrBG")
g <- g + ggtitle(expression(atop("Where it is Happening: Larceny/Theft", atop(italic("Location by Day of Week of Incidents (Summer 2014)"), ""))))
g

kable(t(table(sf_sub$DayOfWeek)))

Session Information

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mapproj_1.2-4  maps_3.0.0-2   ggmap_2.5.2    ggthemes_2.2.1
## [5] ggplot2_1.0.1  reshape2_1.4.1 stringi_1.0-1  knitr_1.11    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.2         magrittr_1.5        MASS_7.3-45        
##  [4] munsell_0.4.2       lattice_0.20-33     geosphere_1.4-3    
##  [7] colorspace_1.2-6    rjson_0.2.15        jpeg_0.1-8         
## [10] highr_0.5.1         stringr_1.0.0       plyr_1.8.3         
## [13] tools_3.2.2         grid_3.2.2          gtable_0.1.2       
## [16] png_0.1-7           htmltools_0.2.6     yaml_2.1.13        
## [19] digest_0.6.8        RJSONIO_1.3-0       RColorBrewer_1.1-2 
## [22] formatR_1.2.1       evaluate_0.8        rmarkdown_0.8.1    
## [25] labeling_0.3        sp_1.2-1            RgoogleMaps_1.2.0.7
## [28] scales_0.3.0        proto_0.3-10