As the helpfully descriptive title suggests, this report details an analysis of crime data in San Francisco during the summer months of 2014. Focusing on one of the more frequent crime types, larceny and theft, we use R and ggplot2 to slice, dice and plot the data to figure out the time of day and week that such crime is most likely to occur.
In this section, we download and prepare the data for analysis.
The dataset consists of incidents recorded in the summer of 2014, in various areas of San Francisco, California.
Import necessary libraries:
library(RCurl) # used for downloading dataset
library(ggplot2) # used for plotting
library(scales) # used for scaling plot axes
Download and read in data, check the dimensions, and preview a few rows:
dat_file <- getURL("https://raw.githubusercontent.com/uwescience/datasci_course_materials/master/assignment6/sanfrancisco_incidents_summer_2014.csv")
dat <- read.csv(text=dat_file)
dim(dat)
## [1] 28993 13
head(dat,4)
## IncidntNum Category Descript DayOfWeek
## 1 140734311 ARSON ARSON OF A VEHICLE Sunday
## 2 140736317 NON-CRIMINAL LOST PROPERTY Sunday
## 3 146177923 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Sunday
## 4 146177531 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Sunday
## Date Time PdDistrict Resolution Address
## 1 08/31/2014 23:50 BAYVIEW NONE LOOMIS ST / INDUSTRIAL ST
## 2 08/31/2014 23:45 MISSION NONE 400 Block of CASTRO ST
## 3 08/31/2014 23:30 SOUTHERN NONE 1000 Block of MISSION ST
## 4 08/31/2014 23:30 RICHMOND NONE FULTON ST / 26TH AV
## X Y Location PdId
## 1 -122.4056 37.73832 (37.7383221869053, -122.405646994567) 1.407343e+13
## 2 -122.4350 37.76177 (37.7617677182954, -122.435012093789) 1.407363e+13
## 3 -122.4098 37.78004 (37.7800356268394, -122.409795194505) 1.461779e+13
## 4 -122.4853 37.77252 (37.7725176473142, -122.485262988324) 1.461775e+13
Many questions can be explored in this rich dataset, considering that time, location, type, and resolution of each incident are reported. In this report, we will be aggregating data by time, and not focusing on geographical details. Therefore, we remove columns not used:
keep <- which(names(dat) %in% c("Category","Descript","DayOfWeek","Date",
"Time","Resolution"))
dat <- dat[, keep]
Check for missing values:
any(is.na(dat))
## [1] FALSE
No missing values. The dataset was already pretty clean to begin with, so not many steps were necessary here. Now we can begin the analysis.
Let’s find the most frequent crimes, regardless of time or date of occurrence.
ggplot(dat,aes(x=reorder(Category,Category,function(x)+length(x)))) +
geom_bar(color="white") +
coord_flip() +
labs(x="Category", y="Count", title="Count of all Crimes by Category") +
theme_bw() +
theme(text=element_text(size=9))
Sorting by frequency, we see that larceny/theft leads the list. Larceny and theft describes the taking of property, but there are also other related crimes present in this time period, and these are all slightly different. Robbery involves the use of force against a person, vehicle theft involves vehicles, and burglary involves unlawful breaking and entering. Possession of stolen property and Trespass are among other related crimes here.
Let’s focus on the top category, larceny/theft. We create a separate data set with only this category:
lt_dat <- dat[dat$Category=="LARCENY/THEFT",]
lt_keep <- which(names(dat) %in% c("Descript","DayOfWeek","Date",
"Time","Resolution"))
lt_dat <- lt_dat[, lt_keep]
Plotting incidents over time, grouped by date, we notice a slight upward trend in the data. There are several interesting spikes in August.
# convert Date to Date format and add it as a new variable
lt_dat$newDate <- as.Date(lt_dat$Date,format="%m/%d/%Y")
ggplot(lt_dat, aes(x=newDate)) +
geom_freqpoly(binwidth=1,size=0.65,color="black") +
labs(x="Date", y="Count", title="Theft/Larceny by Date") +
theme_bw() +
theme(text=element_text(size=9))
Let’s change scale and look at incidents grouped day of week. We also color the plot bars by whether or not the crime was resolved. In the dataset, this corresponds to whether the Resolution column is NONE (unresolved), or any other entry (resolved).
# reorder DayOfWeek
target <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
lt_dat$DayOfWeek <- factor(lt_dat$DayOfWeek,levels=target)
# add Unresolved variable
lt_dat$Unresolved <- lt_dat$Resolution=="NONE"
ggplot(lt_dat,aes(x=DayOfWeek,fill=Unresolved)) +
geom_bar(color="white") +
labs(x="Day of Week", y="Count", title="Theft/Larceny by Day of Week") +
theme_bw() +
theme(text=element_text(size=9))
We can see that counts rise by a few hundred towards end of the week (Friday, Saturday, Sunday). There could be several reasons that explain this rise. There may actually have been increased crime levels, or increased police presence, or an increased number of reports during those days. It also appears that the fraction of unresolved incidents is less during those days of the week.
Now let’s plot the data by hour of the day.
# Convert Time to POSIXct and add it as a new variable
lt_dat$newTime <- as.POSIXct(lt_dat$Time,format="%H:%M")
ggplot(lt_dat, aes(x=newTime,fill=Unresolved)) +
geom_histogram(binwidth = 60*60, color="white") +
labs(x="Time of Day (hour)", y="Count", title="Theft/Larceny by Hour of Day") +
scale_x_datetime(labels=date_format("%H")) +
theme_bw() +
theme(text=element_text(size=9))
The plot shows that larceny is likely to increase during later hours, with a peak at 2am. The rise and fall in frequency also coincides with sunset/sunrise. It may be possible that incidents in winter months would occur earlier with more frequency, since the sun sets earlier. It’s also interesting to see that the fraction of unresolved crimes decreases during the evening and night time. Maybe this is due to an increased police presence during the night.
Based on the above analysis, we come to the conclusion that larceny and theft generally occurs more frequently at night and on weekends. We also find that most of this type of crime is unresolved.
We must qualify these findings with the fact that this dataset is aggregated over several districts in one city, during a few months of one years. Therefore, differing patterns could be found by, for example, looking at different regions in San Francisco.
Additionally, from the dataset, it is not clear whether the incidents increase in frequency because there is actually more crime, or because crime is reported more often during certain hours.
Many other questions can be explored using this data. For example - maybe there are incidents where two or more crimes were commited at the same time. It’s also likely that statistics differ by region (PdDistrict). It’s even possible that specific weekends and holidays (for example, 4th of July), have increased crime levels.