Summary
- Reproducibility - R Markdown
Introduction
- Analysis
- Data
  - Links
- Code
Exploratory Data Analysis
- Loading the data
- What
- When
- Where
Appendix A
- Code
- Session Information

Summary

Specialisation	Data Science at Scale
Course	Communicating Data Science Results
Education Institution	Washington University
Publisher	Coursera
Assignment	Crime Analytics: Visualisation of Incident Reports

Reproducibility - R Markdown

This is a R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see rmarkdown.rstudio.com.

Using the Knit package a document is generated and includes both content as well as the output of any embedded R code chunks within the document. More about knitr at yihui.name/knitr/.

This document is publish at RPubs. RPubs is a service to publish R Markdown documents on the web. Prerequisites are R itself, RStudio (v0.96.230 or later), and the knitr package (v0.5 or later). More details on RPubs at rpubs.com.

Introduction

Analysis

The investigation will be an initial Exploratory Data Analysis focusing on visualisation. It will follow a What, When, Where approach.

Data

The data set used in this report is about Police Incident Reports of the city of San Francisco, on the Summer 2014. The data set is provided by Coursera.

Code

This report was produced using R and the R packages stringi, reshape2, ggplot2, ggthemes and ggmap.

Although the R code is intermixed with the report so the calculations and data/variables are available, the code presentation is hold to the end to avoid cluttering of the text, charts and code in the main part of the report.

Exploratory Data Analysis

Loading the data

What

The initial question is: What is happening?

The variables that can help answering this question are Category, Descript and Resolution.

The variable Category has 34 levels.

The top category of incidents is Larceny/theft with 9,466, followed by Other Offenses (3,567) and Non-criminal (3,023).

A next step would be to break down each or the top categories by their descriptions to understand them in more details.

The variable Resolution has 16 levels.

The top resolution of incidents is None with 19,139, followed by Arrest, Booked (6,502) and Arrest, Cited (1,419).

When

Another question is: When is it happening?

The variables that can help answering this question are DayOfWeek, Date and Time.

Crossing the Day of the Week and the Hour, can help to understand the most typical hours when the incidents happen.

It can be seen that during the week the incidents occur more in the evenings (17 to 18 hours, stretching to 20 hours, and a second concentration around 12 hours). During weekends and including Fridays, the incidents are concentrated in the evening and then at late hours, around 23 hours.

Further analysis could investigate the incidents by category or resolution as well as with location.

Where

Finally the question becomes: Where is it happening?

The variables that can help answering this question are PdDistrict, Address, X and Y and Location.

As the geographical coordinates are available, it is possible to identify precisely the location of the incidents and relate to categories and periods. Larceny/theft and somewhat of Warrants are more predominant in the bay area, while Vehicle Theft are spread across the city.

Assault	Drug/narcotic	Larceny/theft	Vehicle Theft	Warrants
2882	1345	9466	1966	1782

When filtering Larceny/Theft only and checking for the Day of the Week, there is a concentration in the bay are and on Fridays and weekends.

Monday	Tuesday	Wednesday	Thursday	Friday	Saturday	Sunday
1233	1237	1228	1235	1445	1583	1505

Appendix A

Code

# load packages
require(stringi)  # Character String Processing Facilities
require(reshape2) # Flexibly Reshape Data
require(ggplot2)  # An Implementation of the Grammar of Graphics
require(ggthemes) # Extra Themes, Scales and Geoms for ggplot2
require(ggmap)    # Spatial Visualization with ggplot2

# Functions
# - Capitalise
capwords <- function(s, strict = FALSE) {
  cap <- function(s) paste(toupper(substring(s, 1, 1)),
                           {s <- substring(s, 2); if (strict) tolower(s) else s},
                           sep      = "", 
                           collapse = " ")
  sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}

# Load San Francisco's data set
sf <- read.csv(file             = "sanfrancisco_incidents_summer_2014.csv",
               stringsAsFactors = FALSE)

# convert Category to factors (Nominal) variable
sf$Category   <- factor(capwords(tolower(sf$Category)))

# reduce Category to a table of frequencies
Data           <- as.data.frame(table(sf$Category), stringsAsFactors = FALSE)
# change columns' names
colnames(Data) <- c("Category", "Frequency")
# order by decresing frequency
Data           <- Data[order(Data$Frequency, decreasing = TRUE), ]
# force the new order as factor, necessary to force the order in the chart
Data$Category  <- factor(Data$Category, levels = Data$Category)
# filter to the top 15
Data           <- Data[1:15, ]

# create the chart for category 
g <- ggplot(Data)
g <- g + geom_histogram(aes(x = Category, y = Frequency, fill = Frequency), stat = "identity")
g <- g + ggtitle(expression(atop("What is Happening", atop(italic("Top 15 Category of Incidents (Summer 2014)"), ""))))
g <- g + scale_fill_continuous(low = "orange", high = "red")
g <- g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

# convert Resolution to factors (Nominal) variable
sf$Resolution <- factor(capwords(tolower(sf$Resolution)))

# reduce Resolution to a table of frequencies
Data            <- as.data.frame(table(sf$Resolution), stringsAsFactors = FALSE)
# change columns' names
colnames(Data)  <- c("Resolution", "Frequency")
# order by decresing frequency
Data            <- Data[order(Data$Frequency, decreasing = TRUE), ]
# force the new order as factor, necessary to force the order in the chart
Data$Resolution <- factor(Data$Resolution, levels = Data$Resolution)
# filter to the top 15
Data            <- Data[1:5, ]

# create the chart for resolution
g <- ggplot(Data)
g <- g + geom_histogram(aes(x = Resolution, y = Frequency, fill = Frequency), stat = "identity")
g <- g + ggtitle(expression(atop("What is Happening", atop(italic("Top 5 Resolutions of Incidents (Summer 2014)"), ""))))
g <- g + scale_fill_continuous(low = "orange", high = "red")
g

# convert DayOfWeek to factors (Nominal) variable and force calendar order
sf$DayOfWeek <- factor(sf$DayOfWeek, 
                       levels = c("Monday", "Tuesday",  "Wednesday", "Thursday", 
                                  "Friday", "Saturday", "Sunday"))
sf$Hour      <- stri_replace_all_regex(str         = sf$Time,
                                       pattern     = "([0-2][0-9]).*",
                                       replacement = "$1")
sf$Hour      <- factor(sf$Hour)

# reduce DayOfWeek and Hour to a table of frequencies
Data            <- dcast(sf, DayOfWeek + Hour ~ .)
# change columns' names
colnames(Data)  <- c("DayOfWeek", "Hour", "Frequency")

# create the chart for resolution
g <- ggplot(Data)
g <- g + geom_point(aes(x = DayOfWeek, y = Hour, size = Frequency, colour = Frequency), stat = "identity")
g <- g + ggtitle(expression(atop("When it is Happening", atop(italic("Hour by Day of the Week of Incidents (Summer 2014)"), ""))))
g <- g + scale_colour_continuous(low = "yellow", high = "red")
g

sf$Latitude   <- sf$Y
sf$Longitude  <- sf$X

# filter top 5 relevant categories
sf_sub          <- subset(sf, Category %in% c("Larceny/theft", "Assault", 
                                              "Vehicle Theft", "Warrants", 
                                              "Drug/narcotic"))
sf_sub$Category <- factor(sf_sub$Category)

# reduce Location to a table of frequencies
Data            <- dcast(sf_sub, Latitude + Longitude + Category ~ .)
colnames(Data)  <- c("Latitude", "Longitude", "Category", "Frequency")

# create the chart for Location / Categories
g <- qmplot(Longitude, Latitude, data = Data, color = Category, size = I(1.5),
            maptype = "toner-lite")
g <- g + scale_colour_brewer(type = "div", palette = "Accent")
g <- g + ggtitle(expression(atop("Where it is Happening", atop(italic("Location by Top Categories of Incidents (Summer 2014)"), ""))))
g

kable(t(table(sf_sub$Category)))

# filter top category
sf_sub          <- subset(sf, Category == "Larceny/theft")

# reduce Location to a table of frequencies
Data            <- dcast(sf_sub, Latitude + Longitude + DayOfWeek ~ .)
colnames(Data)  <- c("Latitude", "Longitude", "DayOfWeek", "Frequency")

# create the chart for Location / Categories
g <- qmplot(Longitude, Latitude, data = Data, 
            color = DayOfWeek, size = I(1.5), maptype = "toner-lite")
g <- g + scale_colour_brewer(type = "div", palette = "BrBG")
g <- g + ggtitle(expression(atop("Where it is Happening: Larceny/Theft", atop(italic("Location by Day of Week of Incidents (Summer 2014)"), ""))))
g

kable(t(table(sf_sub$DayOfWeek)))

Session Information

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mapproj_1.2-4  maps_3.0.0-2   ggmap_2.5.2    ggthemes_2.2.1
## [5] ggplot2_1.0.1  reshape2_1.4.1 stringi_1.0-1  knitr_1.11    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.2         magrittr_1.5        MASS_7.3-45        
##  [4] munsell_0.4.2       lattice_0.20-33     geosphere_1.4-3    
##  [7] colorspace_1.2-6    rjson_0.2.15        jpeg_0.1-8         
## [10] highr_0.5.1         stringr_1.0.0       plyr_1.8.3         
## [13] tools_3.2.2         grid_3.2.2          gtable_0.1.2       
## [16] png_0.1-7           htmltools_0.2.6     yaml_2.1.13        
## [19] digest_0.6.8        RJSONIO_1.3-0       RColorBrewer_1.1-2 
## [22] formatR_1.2.1       evaluate_0.8        rmarkdown_0.8.1    
## [25] labeling_0.3        sp_1.2-1            RgoogleMaps_1.2.0.7
## [28] scales_0.3.0        proto_0.3-10

Crime Analytics

Visualisation of Incident Reports

Angelo Klin

29 November 2015

Summary

Reproducibility - R Markdown

Introduction

Analysis

Data

Links

Code

Exploratory Data Analysis

Loading the data

What

When

Where

Appendix A

Code

Session Information