In this assignment, we will analyze criminal incident data from San Francisco to visualize patterns and will take a further indepth look at Reported Assault Data.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
library(stringi)
## Warning: package 'stringi' was built under R version 3.1.3
library(reshape2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.1.3
require(ggmap)
## Loading required package: ggmap
## Warning: package 'ggmap' was built under R version 3.1.3
library(chron)
## Warning: package 'chron' was built under R version 3.1.3
Task 1 - Data acquisition and cleaning
The goal of this task is to get familiar with the database and do the necessary cleaning. After this exercise, we should understand what real data looks like and how much effort into cleaning the data.
Open File
setwd("C:/Users/aj282_000/OneDrive/Coursera/Data_at_Scale - Univ of Wash/Communicating Data Science Results")
SF_CrimeData <- read.csv('SFCrimeData.csv', na.strings=c("NA","#DIV/0!",""))
Review Data
Before cleaning up and structuring the data, we will first take a look at the structure of the data.
dim(SF_CrimeData)
## [1] 28993 13
summary(SF_CrimeData)
## IncidntNum Category
## Min. : 10284385 LARCENY/THEFT :9466
## 1st Qu.:140545607 OTHER OFFENSES:3567
## Median :140632022 NON-CRIMINAL :3023
## Mean :142017280 ASSAULT :2882
## 3rd Qu.:140719664 VEHICLE THEFT :1966
## Max. :990367398 WARRANTS :1782
## (Other) :6307
## Descript DayOfWeek Date
## GRAND THEFT FROM LOCKED AUTO: 3766 Friday :4451 06/28/2014: 410
## STOLEN AUTOMOBILE : 1350 Monday :4005 08/09/2014: 410
## LOST PROPERTY : 1202 Saturday :4319 08/08/2014: 403
## PETTY THEFT OF PROPERTY : 1125 Sunday :4218 06/29/2014: 397
## WARRANT ARREST : 980 Thursday :3968 08/29/2014: 388
## PETTY THEFT FROM LOCKED AUTO: 955 Tuesday :3930 06/04/2014: 380
## (Other) :19615 Wednesday:4102 (Other) :26605
## Time PdDistrict Resolution
## 12:00 : 784 SOUTHERN :5739 NONE :19139
## 00:01 : 661 MISSION :3700 ARREST, BOOKED : 6502
## 18:00 : 649 NORTHERN :3589 ARREST, CITED : 1419
## 19:00 : 621 CENTRAL :3513 LOCATED : 1042
## 17:00 : 594 BAYVIEW :2725 UNFOUNDED : 260
## 20:00 : 586 INGLESIDE:2378 JUVENILE BOOKED: 163
## (Other):25098 (Other) :7349 (Other) : 468
## Address X Y
## 800 Block of BRYANT ST : 948 Min. :-122.5 Min. :37.71
## 800 Block of MARKET ST : 288 1st Qu.:-122.4 1st Qu.:37.76
## 900 Block of POTRERO AV : 230 Median :-122.4 Median :37.78
## 1000 Block of POTRERO AV: 199 Mean :-122.4 Mean :37.77
## 2000 Block of MISSION ST: 149 3rd Qu.:-122.4 3rd Qu.:37.79
## 16TH ST / MISSION ST : 116 Max. :-122.4 Max. :37.82
## (Other) :27063
## Location PdId
## (37.775420706711, -122.403404791479) : 940 Min. :1.028e+12
## (37.7571580431915, -122.406604919508): 224 1st Qu.:1.405e+13
## (37.7564864109309, -122.406539115148): 196 Median :1.406e+13
## (37.7650501214668, -122.419671780296): 152 Mean :1.420e+13
## (37.7841893501425, -122.407633520742): 150 3rd Qu.:1.407e+13
## (37.7285280627465, -122.475647460786): 102 Max. :9.904e+13
## (Other) :27229
names(SF_CrimeData)
## [1] "IncidntNum" "Category" "Descript" "DayOfWeek" "Date"
## [6] "Time" "PdDistrict" "Resolution" "Address" "X"
## [11] "Y" "Location" "PdId"
head(SF_CrimeData)
## IncidntNum Category Descript DayOfWeek
## 1 140734311 ARSON ARSON OF A VEHICLE Sunday
## 2 140736317 NON-CRIMINAL LOST PROPERTY Sunday
## 3 146177923 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Sunday
## 4 146177531 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Sunday
## 5 140734220 NON-CRIMINAL FOUND PROPERTY Sunday
## 6 140734349 DRUG/NARCOTIC POSSESSION OF MARIJUANA Sunday
## Date Time PdDistrict Resolution Address
## 1 08/31/2014 23:50 BAYVIEW NONE LOOMIS ST / INDUSTRIAL ST
## 2 08/31/2014 23:45 MISSION NONE 400 Block of CASTRO ST
## 3 08/31/2014 23:30 SOUTHERN NONE 1000 Block of MISSION ST
## 4 08/31/2014 23:30 RICHMOND NONE FULTON ST / 26TH AV
## 5 08/31/2014 23:23 RICHMOND NONE 800 Block of LA PLAYA ST
## 6 08/31/2014 23:13 SOUTHERN ARREST, BOOKED 11TH ST / MINNA ST
## X Y Location PdId
## 1 -122.4056 37.73832 (37.7383221869053, -122.405646994567) 1.407343e+13
## 2 -122.4350 37.76177 (37.7617677182954, -122.435012093789) 1.407363e+13
## 3 -122.4098 37.78004 (37.7800356268394, -122.409795194505) 1.461779e+13
## 4 -122.4853 37.77252 (37.7725176473142, -122.485262988324) 1.461775e+13
## 5 -122.5099 37.77231 (37.7723131976814, -122.509895418239) 1.407342e+13
## 6 -122.4166 37.77391 (37.773907074489, -122.416578493475) 1.407343e+13
Clean and Structure Data
In this step, we will clean up the data and prepare it for visualization and analysis.
SF_CrimeData$Date <- as.Date(SF_CrimeData$Date, "%m/%d/%Y")
breaks <- c(0, 6, 12, 18, 20, 24) / 24
labels <- c("midnight","morning", "afternoon", "evening","night")
timeList <- times(paste0(SF_CrimeData$Time, ":00"))
SF_CrimeData$timeOfDay <- cut(timeList, breaks, labels, include.lowest = TRUE)
#Top 10 Crimes by Category
SF_CrimeData$Category <- as.factor(SF_CrimeData$Category)
SF_CD_CategoryData <- as.data.frame(table(SF_CrimeData$Category), stringsAsFactors=FALSE)
colnames(SF_CD_CategoryData) <- c("Category", "Frequency")
SF_CD_CategoryData <- SF_CD_CategoryData[order(SF_CD_CategoryData$Frequency, decreasing= TRUE), ]
SF_CD_CategoryData_Top10 <- SF_CD_CategoryData[1:10, ]
top10Crimes <- SF_CD_CategoryData_Top10$Category
#Top 10 Crimes by Category and Other Variables
SF_CD_CategoryData_Full <- as.data.frame(table(SF_CrimeData$Category, SF_CrimeData$timeOfDay, SF_CrimeData$DayOfWeek, SF_CrimeData$PdDistrict, SF_CrimeData$Resolution), stringsAsFactors=FALSE)
colnames(SF_CD_CategoryData_Full) <- c("Category", "TimeOfDay", "DayofWeek", "PdDistrict", "Resolution", "Frequency")
SF_CD_CategoryData_Full_Top10 <- SF_CD_CategoryData_Full[SF_CD_CategoryData_Full$Category %in% top10Crimes,]
#Top 10 Crimes by Category and Time of Day
SF_CD_CategoryData_TOD <- as.data.frame(table(SF_CrimeData$Category, SF_CrimeData$timeOfDay), stringsAsFactors=FALSE)
colnames(SF_CD_CategoryData_TOD) <- c("Category", "TimeOfDay", "Frequency")
SF_CD_CategoryData_TOD_Top10 <- SF_CD_CategoryData_TOD[SF_CD_CategoryData_TOD$Category %in% top10Crimes,]
#Top 10 Crimes by Category and Day of Week
SF_CD_CategoryData_DOW <- as.data.frame(table(SF_CrimeData$Category, SF_CrimeData$DayOfWeek), stringsAsFactors=FALSE)
colnames(SF_CD_CategoryData_DOW) <- c("Category", "DayOfWeek", "Frequency")
SF_CD_CategoryData_DOW_Top10 <- SF_CD_CategoryData_DOW[SF_CD_CategoryData_DOW$Category %in% top10Crimes,]
#Assault by Day of Week and Time of Day
SF_AssaultData <- subset(SF_CrimeData, Category == "ASSAULT")
SF_AssaultData_TOD_DOW <- as.data.frame(table(SF_AssaultData$timeOfDay, SF_AssaultData$DayOfWeek ), stringsAsFactors=FALSE)
colnames(SF_AssaultData_TOD_DOW) <- c("TimeOfDay", "DayOfWeek", "Frequency")
#Assault Data by Location
SF_AssaultData$Longitude <- SF_AssaultData$X
SF_AssaultData$Latitude <- SF_AssaultData$Y
SF_AD_Location <- dcast(SF_AssaultData, Latitude + Longitude ~ .)
## Using Latitude as value column: use value.var to override.
## Aggregation function missing: defaulting to length
colnames(SF_AD_Location) <- c("Latitude", "Longitude", "Frequency")
#Assault Data by Location And Day of the Week
SF_AD_Location_DOW <- dcast(SF_AssaultData, Latitude + Longitude + DayOfWeek ~ .)
## Using Latitude as value column: use value.var to override.
## Aggregation function missing: defaulting to length
colnames(SF_AD_Location_DOW) <- c("Latitude", "Longitude", "DayOfWeek", "Frequency")
First we will look at the Top 10 Crimes by Frequency in San Francisco
positions <- SF_CD_CategoryData_Top10$Category
ggplot(SF_CD_CategoryData_Top10) + geom_bar(aes(x = Category, y = Frequency, fill = Frequency), stat = "identity") + scale_x_discrete(limits = positions) + ggtitle("Top 10 Categories of Crimes for San Francisco (Summer 2014)") + xlab("Category of Crime") + ylab("Frequency of Crime") + theme(plot.title = element_text(size = rel(1.2))) + theme(axis.text.x = element_text(angle = 75, hjust = 1))
Next we take a look at Top 10 Crimes by Frequency by Time of Day
timePositions <- c("morning", "afternoon", "evening","night","midnight")
positions <- SF_CD_CategoryData_Top10$Category
ggplot(SF_CD_CategoryData_TOD_Top10, aes(Category, TimeOfDay)) + geom_tile(aes(fill=Frequency)) + scale_x_discrete(limits = positions) + scale_y_discrete(limits = timePositions) + ggtitle("Top 10 Crimes by Time of Day in San Francisco (Summer 2014)") + xlab("Category of Crime") + ylab("Time of Day") + theme(plot.title = element_text(size = rel(1.2))) + theme(axis.text.x = element_text(angle = 75, hjust = 1))
And Top 10 Crimes by Frequency by Day of Week
dayPositions <- c("Monday", "Tuesday", "Wednesday","Thursday","Friday", "Saturday", "Sunday")
positions <- SF_CD_CategoryData_Top10$Category
ggplot(SF_CD_CategoryData_DOW_Top10, aes(Category, DayOfWeek)) + geom_tile(aes(fill=Frequency)) + scale_x_discrete(limits = positions) + scale_y_discrete(limits = dayPositions) + ggtitle("Top 10 Crimes by Day of Week in San Francisco (Summer 2014)") + xlab("Category of Crime") + ylab("Day of Week") + theme(plot.title = element_text(size = rel(1.2))) + theme(axis.text.x = element_text(angle = 75, hjust = 1))
Interestingly, of the top 10 Crimes the highest frequency is actually during the afternoon time and less after midnight and into the morning. However, Assault appears to be more frequent during the evening and midnight time frame relative to other crimes. Some obvious considerations into why this could be include such as this is when people are home, people tend to drink alcohol more at night, etc.
Taking a further look at reported Assault
First, we look at a heatmap of Day of Week and Time of Day of Reported Assaults and can clearly see a pattern of higher frequency during the afternoon and interestingly higher on Monday and Friday than other days of the week. Also unsurprisingly, the frequency of Assaults occurring after midnight on Saturday (Friday night) and Sunday (Saturday night).
dayPositions <- c("Monday", "Tuesday", "Wednesday","Thursday","Friday", "Saturday", "Sunday")
timePositions <- c("morning", "afternoon", "evening","night","midnight")
ggplot(SF_AssaultData_TOD_DOW, aes(DayOfWeek, TimeOfDay)) + geom_tile(aes(fill=Frequency)) + scale_x_discrete(limits = dayPositions) + scale_y_discrete(limits = timePositions) + ggtitle("Reported Assault by Time of Day and Day of Week in San Francisco (Summer 2014)") + xlab("Time Of Day") + ylab("Day of Week") + theme(plot.title = element_text(size = rel(1.2))) + theme(axis.text.x = element_text(angle = 75, hjust = 1))
Second, we take a look at location of Reported Assaults by Frequency. Interestingly, the number of Reported Assaults is quite high around Chinatown, Union Square, and Financial District, areas where tourists frequent. In terms of encouraging tourism, this is not an especially good trend.
qmplot(Longitude, Latitude, data = SF_AD_Location, size = Frequency, maptype = "toner-lite") + scale_colour_brewer(type = "div", palette = "Accent") + ggtitle(expression(atop(bold("Reported Assault in San Francisco by Location of Incidents (Summer 2014)"), ""))) + theme(plot.title = element_text(size = rel(.9)))
## Using zoom = 13...
## Map from URL : http://tile.stamen.com/toner-lite/13/1308/3165.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1309/3165.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1310/3165.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1311/3165.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1308/3166.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1309/3166.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1310/3166.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1311/3166.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1308/3167.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1309/3167.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1310/3167.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1311/3167.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1308/3168.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1309/3168.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1310/3168.png
## Map from URL : http://tile.stamen.com/toner-lite/13/1311/3168.png
Last, we take a look again at Reported Assaults by Location and also Day of the Week.
qmplot(Longitude, Latitude, data = SF_AD_Location_DOW, color=DayOfWeek, size = Frequency, maptype = "toner-lite") + scale_colour_brewer(type = "div", palette = "Accent") + ggtitle(expression(atop(bold("Reported Assault in San Francisco by Location and Day of Week of Incidents (Summer 2014)"), ""))) + theme(plot.title = element_text(size = rel(.8)))
## Using zoom = 13...
The San Francisco Crime Data provides a rich set of detail that can provide significant insight into trends around reported crimes. Differences in frequency of reported crimes by day of the week and time of day can be used for insight into police staffing requirements. Differences in frequency of type of reported crimes and location can give insight to where police might want to increase attention and patrolling. Frequency by location of Reported Assaults also highlights what should be concerning trend of the level of Reported Assaults occurring in what is tourist frequented locations.