Project Description
Analysis of criminal incident data from San Francisco to visualize patterns.
Produce a blog-post-style visual narrative consisting of a series of visualizations interspersed with sufficient descriptive text to make a convincing argument and that can be reproducible.
The software used for the analysis is R and the report is generated in a R markdown file that produce an html file.
Data Description
In this analysis, we use real crime data from Summer 2014 from San Francisco. It covers the months of june, july and august. There are 28993 observations in the data set, each observation correspond to a crime. For each crime observed, there are 13 variables, among them the date and time of occurrence, the crime category and the district location.
Getting the data
The original data came from the San Francisco Data Portal (https://data.sfgov.org/). For the analysis we use a subset of the data available on the course github repository.
#Working directory specification
setwd("~/coursera/Data Science at Scale UW/Communicating Data Science Results/ass6")
#San Francisco data
filesf <- "https://github.com/uwescience/datasci_course_materials/blob/master/assignment6/sanfrancisco_incidents_summer_2014.csv"
#download.file(filesf, destfile = "./sanfrancisco_incidents_summer_2014.csv", method = "auto")
sf <- read.csv("./sanfrancisco_incidents_summer_2014.csv",stringsAsFactors=FALSE)
#structure of the data
str(sf)
## 'data.frame': 28993 obs. of 13 variables:
## $ IncidntNum: int 140734311 140736317 146177923 146177531 140734220 140734349 140734349 140734349 140738147 140734258 ...
## $ Category : chr "ARSON" "NON-CRIMINAL" "LARCENY/THEFT" "LARCENY/THEFT" ...
## $ Descript : chr "ARSON OF A VEHICLE" "LOST PROPERTY" "GRAND THEFT FROM LOCKED AUTO" "GRAND THEFT FROM LOCKED AUTO" ...
## $ DayOfWeek : chr "Sunday" "Sunday" "Sunday" "Sunday" ...
## $ Date : chr "08/31/2014" "08/31/2014" "08/31/2014" "08/31/2014" ...
## $ Time : chr "23:50" "23:45" "23:30" "23:30" ...
## $ PdDistrict: chr "BAYVIEW" "MISSION" "SOUTHERN" "RICHMOND" ...
## $ Resolution: chr "NONE" "NONE" "NONE" "NONE" ...
## $ Address : chr "LOOMIS ST / INDUSTRIAL ST" "400 Block of CASTRO ST" "1000 Block of MISSION ST" "FULTON ST / 26TH AV" ...
## $ X : num -122 -122 -122 -122 -123 ...
## $ Y : num 37.7 37.8 37.8 37.8 37.8 ...
## $ Location : chr "(37.7383221869053, -122.405646994567)" "(37.7617677182954, -122.435012093789)" "(37.7800356268394, -122.409795194505)" "(37.7725176473142, -122.485262988324)" ...
## $ PdId : num 1.41e+13 1.41e+13 1.46e+13 1.46e+13 1.41e+13 ...
#Creation of a Datetime variable that combine the date ah time f the crime
sf$DateTime=as.POSIXct(strptime(paste(sf$Date, sf$Time), "%m/%d/%Y %H:%M"))
#The Date variable is concerted in a date format
sf$Date=as.Date(sf$Date, format = "%m/%d/%Y")
#The time of the crime is divided in the period of the day
sf$PeriodOfDay <- ifelse(as.POSIXlt(sf$DateTime)$hour<6, "Night",
ifelse(as.POSIXlt(sf$DateTime)$hour<12, "AM",
ifelse(as.POSIXlt(sf$DateTime)$hour<18, "PM",
ifelse(as.POSIXlt(sf$DateTime)$hour<24, "Evening","NA" ))))
#Loading our favorite R package
library(data.table)
## Warning: package 'data.table' was built under R version 3.2.3
#The data frame is transfered into a data table
sfDT<-as.data.table(sf)
Exploratory analysis
We are interested to know how do incidents vary during the day. To answer this we calculated the total number of crime during the 2014 summer in SF for each hour of the day and plot the results.
# how do incidents vary by time of day?
Freq_hour=sfDT[,.N,by=as.POSIXlt(DateTime)$hour][order(as.POSIXlt)]
barplot(height=Freq_hour$N, names = Freq_hour$as.POSIXlt, ylim=c(0,2000),
main="San Francisco Number of crimes by time of day\n Summer 2014",
ylab="Number of crimes",
xlab="Hour of Day", las=1)
In the plot, we observed that the total number of crimes vary for each hour of the day, At the end of the night around 4 is when the total number of incident is at the lowest, then it increased steadily for all the day until the beginning of the evening (17:00-18:00) where it reaches a maximum then decrease slowly in the evening and more rapidly in the night. Interestingly, we observe a kind of peak at noon for the total number of crime.
#We summarize the data for the period of the day
sfDT[,.N,by=PeriodOfDay]
## PeriodOfDay N
## 1: Evening 9810
## 2: PM 9859
## 3: AM 5518
## 4: Night 3806
In San Francisco, during the summer of 2014, it is in the night period that we observe the smallest total number of incident. In the day, most of the crimes are committed during the afternoon (PM) and in the evening.
#Which incidents are most common in the evening?
head(sfDT[PeriodOfDay == "Evening",.N,by=Category][order(-N)],n=10)
## Category N
## 1: LARCENY/THEFT 3835
## 2: OTHER OFFENSES 1029
## 3: VEHICLE THEFT 921
## 4: ASSAULT 909
## 5: NON-CRIMINAL 778
## 6: WARRANTS 543
## 7: DRUG/NARCOTIC 379
## 8: SUSPICIOUS OCC 372
## 9: MISSING PERSON 286
## 10: SECONDARY CODES 135
The previous table listed the top 10 incidents reported for the evening, in SF for the 2014 summer.
We are now interested to know during what period of the day the robberies are more frequent.
#Total number of "ROBBERY" in San Francisco, during 2014 summer
sfDT[Category=="ROBBERY",.N,by=Category]
## Category N
## 1: ROBBERY 308
There is a total number of 308 reported robberies in SF during 2014 summer, to know how it varies during the day, we calculate the total number of robberies for each hour of the day and plot the results.
#During what periods of the day are robberies most common?
Freq_hour_robbery=sfDT[Category=="ROBBERY",.N,by=as.POSIXlt(DateTime)$hour][order(as.POSIXlt)]
barplot(height=Freq_hour_robbery$N, names = Freq_hour_robbery$as.POSIXlt, ylim=c(0,25),
main="San Francisco Number of robberies by time of day\n Summer 2014",
ylab="Number of robberies",
xlab="Hour of Day")
The total number of robbery is relativly constant from the middle of the night (03:00) to the end of the afternoon (16:00) except that it peak solidly at noon. As the evening goes the number reported increase and reach a maximum around midnight (22:00 to 02:00) before diminished.
In the day, it is in the evening that the total number of robberies is at the highest.
sfDT[Category=="ROBBERY",.N,by=Category]
## Category N
## 1: ROBBERY 308
We would like to know how the number of incidents vary across the city.
#how do incidents vary by neighborhood?
sfDT[,.N,by=PdDistrict][order(-N)]
## PdDistrict N
## 1: SOUTHERN 5739
## 2: MISSION 3700
## 3: NORTHERN 3589
## 4: CENTRAL 3513
## 5: BAYVIEW 2725
## 6: INGLESIDE 2378
## 7: TENDERLOIN 2257
## 8: TARAVAL 1853
## 9: PARK 1693
## 10: RICHMOND 1546
The table lists the San Francisco districts with their total number of crimes reported for the summer of 2014. By far, it is in the southern part of the city that the total number of reported crimes is at the highest.
To know the total number of incidents in the central part of the city, we filter the data for the “CENTRAL” district and calculate the total number of incidents for each crime category, for the summer 2014 (top 10 reported).
#Which incidents are most common in the city center?
head(sfDT[PdDistrict=="CENTRAL",.N,by=Category][order(-N)],n=10)
## Category N
## 1: LARCENY/THEFT 1574
## 2: NON-CRIMINAL 431
## 3: OTHER OFFENSES 326
## 4: ASSAULT 277
## 5: WARRANTS 162
## 6: VEHICLE THEFT 146
## 7: SUSPICIOUS OCC 131
## 8: DRUG/NARCOTIC 91
## 9: MISSING PERSON 86
## 10: PROSTITUTION 67
Finally, To identify which districts have the most robberies or thefts, we select the data where the crimes category are “LARCENY/THEFT”,“VEHICLE THEFT” or “ROBBERY” and calculate the total number of incidents for each district and plot the result.
#In what areas or neighborhoods are robberies or thefts most common?
Freq_District_robbery=sfDT[Category %in% c("LARCENY/THEFT","VEHICLE THEFT","ROBBERY"),.N,by=PdDistrict][order(-N)]
barplot(height=Freq_District_robbery$N, names = Freq_District_robbery$PdDistrict, las=2, ylim=c(0,2500),
main="San Francisco Number of robberies or thefts by district\n Summer 2014",
ylab="Number of robberies or thefts",
xlab="SF District")
As expected, it is in the southern part of the town that those crimes are more common.