Introduction and Overview

In this assignment, we will analyze criminal incident data from San Francisco to visualize patterns and will take a further indepth look at Reported Assault Data.

Setting up the R Enviornment

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
library(stringi)
## Warning: package 'stringi' was built under R version 3.1.3
library(reshape2) 
library(ggthemes) 
## Warning: package 'ggthemes' was built under R version 3.1.3
require(ggmap)   
## Loading required package: ggmap
## Warning: package 'ggmap' was built under R version 3.1.3
library(chron)
## Warning: package 'chron' was built under R version 3.1.3

Obtaining the data - Download the data and load/manipulate it in R

Task 1 - Data acquisition and cleaning

The goal of this task is to get familiar with the database and do the necessary cleaning. After this exercise, we should understand what real data looks like and how much effort into cleaning the data.

Open File

setwd("C:/Users/aj282_000/OneDrive/Coursera/Data_at_Scale - Univ of Wash/Communicating Data Science Results")
SF_CrimeData <- read.csv('SFCrimeData.csv', na.strings=c("NA","#DIV/0!",""))

Review Data

Before cleaning up and structuring the data, we will first take a look at the structure of the data.

dim(SF_CrimeData)
## [1] 28993    13
summary(SF_CrimeData)
##    IncidntNum                  Category   
##  Min.   : 10284385   LARCENY/THEFT :9466  
##  1st Qu.:140545607   OTHER OFFENSES:3567  
##  Median :140632022   NON-CRIMINAL  :3023  
##  Mean   :142017280   ASSAULT       :2882  
##  3rd Qu.:140719664   VEHICLE THEFT :1966  
##  Max.   :990367398   WARRANTS      :1782  
##                      (Other)       :6307  
##                          Descript         DayOfWeek            Date      
##  GRAND THEFT FROM LOCKED AUTO: 3766   Friday   :4451   06/28/2014:  410  
##  STOLEN AUTOMOBILE           : 1350   Monday   :4005   08/09/2014:  410  
##  LOST PROPERTY               : 1202   Saturday :4319   08/08/2014:  403  
##  PETTY THEFT OF PROPERTY     : 1125   Sunday   :4218   06/29/2014:  397  
##  WARRANT ARREST              :  980   Thursday :3968   08/29/2014:  388  
##  PETTY THEFT FROM LOCKED AUTO:  955   Tuesday  :3930   06/04/2014:  380  
##  (Other)                     :19615   Wednesday:4102   (Other)   :26605  
##       Time           PdDistrict             Resolution   
##  12:00  :  784   SOUTHERN :5739   NONE           :19139  
##  00:01  :  661   MISSION  :3700   ARREST, BOOKED : 6502  
##  18:00  :  649   NORTHERN :3589   ARREST, CITED  : 1419  
##  19:00  :  621   CENTRAL  :3513   LOCATED        : 1042  
##  17:00  :  594   BAYVIEW  :2725   UNFOUNDED      :  260  
##  20:00  :  586   INGLESIDE:2378   JUVENILE BOOKED:  163  
##  (Other):25098   (Other)  :7349   (Other)        :  468  
##                      Address            X                Y        
##  800 Block of BRYANT ST  :  948   Min.   :-122.5   Min.   :37.71  
##  800 Block of MARKET ST  :  288   1st Qu.:-122.4   1st Qu.:37.76  
##  900 Block of POTRERO AV :  230   Median :-122.4   Median :37.78  
##  1000 Block of POTRERO AV:  199   Mean   :-122.4   Mean   :37.77  
##  2000 Block of MISSION ST:  149   3rd Qu.:-122.4   3rd Qu.:37.79  
##  16TH ST / MISSION ST    :  116   Max.   :-122.4   Max.   :37.82  
##  (Other)                 :27063                                   
##                                   Location          PdId          
##  (37.775420706711, -122.403404791479) :  940   Min.   :1.028e+12  
##  (37.7571580431915, -122.406604919508):  224   1st Qu.:1.405e+13  
##  (37.7564864109309, -122.406539115148):  196   Median :1.406e+13  
##  (37.7650501214668, -122.419671780296):  152   Mean   :1.420e+13  
##  (37.7841893501425, -122.407633520742):  150   3rd Qu.:1.407e+13  
##  (37.7285280627465, -122.475647460786):  102   Max.   :9.904e+13  
##  (Other)                              :27229
names(SF_CrimeData)
##  [1] "IncidntNum" "Category"   "Descript"   "DayOfWeek"  "Date"      
##  [6] "Time"       "PdDistrict" "Resolution" "Address"    "X"         
## [11] "Y"          "Location"   "PdId"
head(SF_CrimeData)
##   IncidntNum      Category                     Descript DayOfWeek
## 1  140734311         ARSON           ARSON OF A VEHICLE    Sunday
## 2  140736317  NON-CRIMINAL                LOST PROPERTY    Sunday
## 3  146177923 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO    Sunday
## 4  146177531 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO    Sunday
## 5  140734220  NON-CRIMINAL               FOUND PROPERTY    Sunday
## 6  140734349 DRUG/NARCOTIC      POSSESSION OF MARIJUANA    Sunday
##         Date  Time PdDistrict     Resolution                   Address
## 1 08/31/2014 23:50    BAYVIEW           NONE LOOMIS ST / INDUSTRIAL ST
## 2 08/31/2014 23:45    MISSION           NONE    400 Block of CASTRO ST
## 3 08/31/2014 23:30   SOUTHERN           NONE  1000 Block of MISSION ST
## 4 08/31/2014 23:30   RICHMOND           NONE       FULTON ST / 26TH AV
## 5 08/31/2014 23:23   RICHMOND           NONE  800 Block of LA PLAYA ST
## 6 08/31/2014 23:13   SOUTHERN ARREST, BOOKED        11TH ST / MINNA ST
##           X        Y                              Location         PdId
## 1 -122.4056 37.73832 (37.7383221869053, -122.405646994567) 1.407343e+13
## 2 -122.4350 37.76177 (37.7617677182954, -122.435012093789) 1.407363e+13
## 3 -122.4098 37.78004 (37.7800356268394, -122.409795194505) 1.461779e+13
## 4 -122.4853 37.77252 (37.7725176473142, -122.485262988324) 1.461775e+13
## 5 -122.5099 37.77231 (37.7723131976814, -122.509895418239) 1.407342e+13
## 6 -122.4166 37.77391  (37.773907074489, -122.416578493475) 1.407343e+13

Clean and Structure Data

In this step, we will clean up the data and prepare it for visualization and analysis.

SF_CrimeData$Date <- as.Date(SF_CrimeData$Date, "%m/%d/%Y")
breaks <- c(0, 6,  12,  18, 20, 24) / 24 
labels <- c("midnight","morning", "afternoon", "evening","night")
timeList <- times(paste0(SF_CrimeData$Time, ":00"))
SF_CrimeData$timeOfDay <- cut(timeList, breaks, labels, include.lowest = TRUE)

#Top 10 Crimes by Category
SF_CrimeData$Category <- as.factor(SF_CrimeData$Category)
SF_CD_CategoryData <- as.data.frame(table(SF_CrimeData$Category), stringsAsFactors=FALSE)
colnames(SF_CD_CategoryData) <- c("Category", "Frequency")
SF_CD_CategoryData <- SF_CD_CategoryData[order(SF_CD_CategoryData$Frequency, decreasing= TRUE), ]
SF_CD_CategoryData_Top10 <- SF_CD_CategoryData[1:10, ]
top10Crimes <- SF_CD_CategoryData_Top10$Category

#Top 10 Crimes by Category and Other Variables
SF_CD_CategoryData_Full <- as.data.frame(table(SF_CrimeData$Category, SF_CrimeData$timeOfDay, SF_CrimeData$DayOfWeek, SF_CrimeData$PdDistrict, SF_CrimeData$Resolution), stringsAsFactors=FALSE)
colnames(SF_CD_CategoryData_Full) <- c("Category", "TimeOfDay", "DayofWeek", "PdDistrict", "Resolution", "Frequency")
SF_CD_CategoryData_Full_Top10 <- SF_CD_CategoryData_Full[SF_CD_CategoryData_Full$Category %in% top10Crimes,]

#Top 10 Crimes by Category and Time of Day
SF_CD_CategoryData_TOD <- as.data.frame(table(SF_CrimeData$Category, SF_CrimeData$timeOfDay), stringsAsFactors=FALSE)
colnames(SF_CD_CategoryData_TOD) <- c("Category", "TimeOfDay", "Frequency")
SF_CD_CategoryData_TOD_Top10 <- SF_CD_CategoryData_TOD[SF_CD_CategoryData_TOD$Category %in% top10Crimes,]


#Top 10 Crimes by Category and Day of Week

SF_CD_CategoryData_DOW <- as.data.frame(table(SF_CrimeData$Category, SF_CrimeData$DayOfWeek), stringsAsFactors=FALSE)
colnames(SF_CD_CategoryData_DOW) <- c("Category", "DayOfWeek", "Frequency")
SF_CD_CategoryData_DOW_Top10 <- SF_CD_CategoryData_DOW[SF_CD_CategoryData_DOW$Category %in% top10Crimes,]

#Assault by Day of Week and Time of Day
SF_AssaultData <- subset(SF_CrimeData, Category == "ASSAULT")
SF_AssaultData_TOD_DOW <- as.data.frame(table(SF_AssaultData$timeOfDay, SF_AssaultData$DayOfWeek ), stringsAsFactors=FALSE)
colnames(SF_AssaultData_TOD_DOW) <- c("TimeOfDay", "DayOfWeek", "Frequency")

#Assault Data by Location

SF_AssaultData$Longitude <- SF_AssaultData$X
SF_AssaultData$Latitude <- SF_AssaultData$Y
SF_AD_Location <- dcast(SF_AssaultData, Latitude + Longitude ~ .)
## Using Latitude as value column: use value.var to override.
## Aggregation function missing: defaulting to length
colnames(SF_AD_Location) <- c("Latitude", "Longitude", "Frequency")

#Assault Data by Location And Day of the Week

SF_AD_Location_DOW <- dcast(SF_AssaultData, Latitude + Longitude + DayOfWeek ~ .)
## Using Latitude as value column: use value.var to override.
## Aggregation function missing: defaulting to length
colnames(SF_AD_Location_DOW) <- c("Latitude", "Longitude", "DayOfWeek", "Frequency")

Task 2 - Visualization of Crime Data for San Francisco (Summer 2014)

First we will look at the Top 10 Crimes by Frequency in San Francisco

positions <- SF_CD_CategoryData_Top10$Category
ggplot(SF_CD_CategoryData_Top10) + geom_bar(aes(x = Category, y = Frequency, fill = Frequency), stat = "identity") + scale_x_discrete(limits = positions) + ggtitle("Top 10 Categories of Crimes for San Francisco (Summer 2014)") + xlab("Category of Crime") + ylab("Frequency of Crime") + theme(plot.title = element_text(size = rel(1.2))) + theme(axis.text.x = element_text(angle = 75, hjust = 1))

Next we take a look at Top 10 Crimes by Frequency by Time of Day

timePositions <- c("morning", "afternoon", "evening","night","midnight")
positions <- SF_CD_CategoryData_Top10$Category
ggplot(SF_CD_CategoryData_TOD_Top10, aes(Category, TimeOfDay)) + geom_tile(aes(fill=Frequency)) + scale_x_discrete(limits = positions) + scale_y_discrete(limits = timePositions) + ggtitle("Top 10 Crimes by Time of Day in San Francisco (Summer 2014)") + xlab("Category of Crime") + ylab("Time of Day") + theme(plot.title = element_text(size = rel(1.2))) + theme(axis.text.x = element_text(angle = 75, hjust = 1))

And Top 10 Crimes by Frequency by Day of Week

dayPositions <- c("Monday", "Tuesday", "Wednesday","Thursday","Friday", "Saturday", "Sunday")
positions <- SF_CD_CategoryData_Top10$Category
ggplot(SF_CD_CategoryData_DOW_Top10, aes(Category, DayOfWeek)) + geom_tile(aes(fill=Frequency)) + scale_x_discrete(limits = positions) + scale_y_discrete(limits = dayPositions) + ggtitle("Top 10 Crimes by Day of Week in San Francisco (Summer 2014)") + xlab("Category of Crime") + ylab("Day of Week") + theme(plot.title = element_text(size = rel(1.2))) + theme(axis.text.x = element_text(angle = 75, hjust = 1))

Interestingly, of the top 10 Crimes the highest frequency is actually during the afternoon time and less after midnight and into the morning. However, Assault appears to be more frequent during the evening and midnight time frame relative to other crimes. Some obvious considerations into why this could be include such as this is when people are home, people tend to drink alcohol more at night, etc.

Taking a further look at reported Assault

First, we look at a heatmap of Day of Week and Time of Day of Reported Assaults and can clearly see a pattern of higher frequency during the afternoon and interestingly higher on Monday and Friday than other days of the week. Also unsurprisingly, the frequency of Assaults occurring after midnight on Saturday (Friday night) and Sunday (Saturday night).

dayPositions <- c("Monday", "Tuesday", "Wednesday","Thursday","Friday", "Saturday", "Sunday")
timePositions <- c("morning", "afternoon", "evening","night","midnight")
ggplot(SF_AssaultData_TOD_DOW, aes(DayOfWeek, TimeOfDay)) + geom_tile(aes(fill=Frequency)) + scale_x_discrete(limits = dayPositions) + scale_y_discrete(limits = timePositions) + ggtitle("Reported Assault by Time of Day and Day of Week in San Francisco (Summer 2014)") + xlab("Time Of Day") + ylab("Day of Week") + theme(plot.title = element_text(size = rel(1.2))) + theme(axis.text.x = element_text(angle = 75, hjust = 1))

Second, we take a look at location of Reported Assaults by Frequency. Interestingly, the number of Reported Assaults is quite high around Chinatown, Union Square, and Financial District, areas where tourists frequent. In terms of encouraging tourism, this is not an especially good trend.

qmplot(Longitude, Latitude, data = SF_AD_Location, size = Frequency, maptype = "toner-lite") + scale_colour_brewer(type = "div", palette = "Accent") + ggtitle(expression(atop(bold("Reported Assault in San Francisco by Location of Incidents (Summer 2014)"), ""))) + theme(plot.title = element_text(size = rel(.9)))
## Using zoom = 13...
## Map from URL : http://tile.stamen.com/toner-lite/13/1308/3165.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1309/3165.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1310/3165.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1311/3165.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1308/3166.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1309/3166.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1310/3166.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1311/3166.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1308/3167.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1309/3167.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1310/3167.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1311/3167.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1308/3168.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1309/3168.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1310/3168.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1311/3168.png

Last, we take a look again at Reported Assaults by Location and also Day of the Week.

qmplot(Longitude, Latitude, data = SF_AD_Location_DOW, color=DayOfWeek, size = Frequency, maptype = "toner-lite") + scale_colour_brewer(type = "div", palette = "Accent") + ggtitle(expression(atop(bold("Reported Assault in San Francisco by Location and Day of Week of Incidents (Summer 2014)"), ""))) + theme(plot.title = element_text(size = rel(.8)))
## Using zoom = 13...

Conclusion

The San Francisco Crime Data provides a rich set of detail that can provide significant insight into trends around reported crimes. Differences in frequency of reported crimes by day of the week and time of day can be used for insight into police staffing requirements. Differences in frequency of type of reported crimes and location can give insight to where police might want to increase attention and patrolling. Frequency by location of Reported Assaults also highlights what should be concerning trend of the level of Reported Assaults occurring in what is tourist frequented locations.