This document attempts to use publicly available dataset from DataSF, which includes police incident reports filed by officers and by individuals through self-service online reporting for non-emergency cases in San Francisco from 2018 to present, to uncover any insights using Exploratory Data Analysis (EDA).
Take a look at the data types.
## Rows: 336,517
## Columns: 36
## $ `Incident Datetime` <chr> "2019/05/01 01…
## $ `Incident Date` <date> 2019-05-01, 2…
## $ `Incident Time` <time> 01:00:00, 07:…
## $ `Incident Year` <dbl> 2019, 2019, 20…
## $ `Incident Day of Week` <chr> "Wednesday", "…
## $ `Report Datetime` <chr> "2019/06/12 08…
## $ `Row ID` <dbl> 81097515200, 8…
## $ `Incident ID` <dbl> 810975, 814655…
## $ `Incident Number` <chr> "190424067", "…
## $ `CAD Number` <dbl> 191634131, 191…
## $ `Report Type Code` <chr> "II", "II", "I…
## $ `Report Type Description` <chr> "Initial", "In…
## $ `Filed Online` <lgl> NA, NA, NA, NA…
## $ `Incident Code` <chr> "15200", "6402…
## $ `Incident Category` <chr> "Offences Agai…
## $ `Incident Subcategory` <chr> "Other", "Othe…
## $ `Incident Description` <chr> "Domestic Viol…
## $ Resolution <chr> "Open or Activ…
## $ Intersection <chr> "40TH AVE \\ I…
## $ CNN <dbl> 27789000, 2428…
## $ `Police District` <chr> "Taraval", "So…
## $ `Analysis Neighborhood` <chr> "Sunset/Parksi…
## $ `Supervisor District` <dbl> 4, 6, 10, 3, 2…
## $ Latitude <dbl> 37.76257, 37.7…
## $ Longitude <dbl> -122.4996, -12…
## $ point <chr> "(37.762569397…
## $ `SF Find Neighborhoods` <dbl> 39, 32, 88, 10…
## $ `Current Police Districts` <dbl> 10, 1, 2, 6, 4…
## $ `Current Supervisor Districts` <dbl> 7, 10, 9, 3, 6…
## $ `Analysis Neighborhoods` <dbl> 35, 34, 1, 6, …
## $ `HSOC Zones as of 2018-06-05` <dbl> NA, 1, NA, NA,…
## $ `OWED Public Spaces` <dbl> NA, NA, NA, 18…
## $ `Central Market/Tenderloin Boundary Polygon - Updated` <dbl> NA, 1, NA, NA,…
## $ `Parks Alliance CPSI (27+TL sites)` <dbl> NA, NA, NA, NA…
## $ `ESNCAG - Boundary File` <dbl> NA, NA, NA, NA…
## $ `Areas of Vulnerability, 2016` <dbl> 1, 2, 2, 2, 1,…
Some data preparation
char_cols <- c("Row ID", "Incident ID")
data_1 <- data %>% filter(`Incident Year` %in% c(2018, 2019),
`Police District`!="Out of SF",
!(`Incident Category` %in% c('NA', 'Case Closure'))) %>%
mutate_at(char_cols, as.character)Given that there might be multiple Incident IDs in the dataset that represents the one to many relationship when they exist, we would not want to count the duplicate incident reports. It is also noted that for Incident IDs with duplicates, the report datetime are the same for all the duplicated rows. As such, we will remove duplicates by ‘Incident ID’.
To explore the dataset further, we can look at the number of incidents by “Incident Year” and “Incident Category”.
data_cat <- data_2 %>% group_by(`Incident Year`,`Incident Category`) %>%
dplyr::summarise(Incidents = dplyr::n()) %>%
filter(!is.na(`Incident Category`))Check the levels of “Incident Category”.
data_cat$`Incident Category` <- as.factor(data_cat$`Incident Category`)
levels(data_cat$`Incident Category`)## [1] "Arson"
## [2] "Assault"
## [3] "Burglary"
## [4] "Civil Sidewalks"
## [5] "Courtesy Report"
## [6] "Disorderly Conduct"
## [7] "Drug Offense"
## [8] "Drug Violation"
## [9] "Embezzlement"
## [10] "Family Offense"
## [11] "Fire Report"
## [12] "Forgery And Counterfeiting"
## [13] "Fraud"
## [14] "Gambling"
## [15] "Homicide"
## [16] "Human Trafficking (A), Commercial Sex Acts"
## [17] "Human Trafficking, Commercial Sex Acts"
## [18] "Larceny Theft"
## [19] "Liquor Laws"
## [20] "Lost Property"
## [21] "Malicious Mischief"
## [22] "Miscellaneous Investigation"
## [23] "Missing Person"
## [24] "Motor Vehicle Theft"
## [25] "Motor Vehicle Theft?"
## [26] "Non-Criminal"
## [27] "Offences Against The Family And Children"
## [28] "Other"
## [29] "Other Miscellaneous"
## [30] "Other Offenses"
## [31] "Prostitution"
## [32] "Rape"
## [33] "Recovered Vehicle"
## [34] "Robbery"
## [35] "Sex Offense"
## [36] "Stolen Property"
## [37] "Suicide"
## [38] "Suspicious"
## [39] "Suspicious Occ"
## [40] "Traffic Collision"
## [41] "Traffic Violation Arrest"
## [42] "Vandalism"
## [43] "Vehicle Impounded"
## [44] "Vehicle Misplaced"
## [45] "Warrant"
## [46] "Weapons Carrying Etc"
## [47] "Weapons Offence"
## [48] "Weapons Offense"
Looking at the levels of “Incident Category”, there are some categories that can be grouped together.
For example:
After we have group and sum these categories, we can plot a basic horizontal bar chart to see the count of Incidents for each “Incident Category” as seen below.
Here is a pie chart showing the proportion of Incident Report cases that were filed online. Most incident cases were not filed online. Also, there are negligible changes in proportion between 2018 and 2019.
library(ggpubr)
fig <- ggarrange(p1, p2, nrow=1, common.legend = TRUE, legend="right")
annotate_figure(
fig, top=text_grob("Proportion of San Francisco Police Department Incident Reports Filed online\n",
color = "black", face = "bold", size = 12)
)Calculate the number of incidents for each “Incident Date”.
Plot the Calendar Heatmap.
ggplotly(ggplot(data_calendar, aes(Week_of_month, Day_of_week, fill = Incidents)) +
geom_tile(colour = "white") +
facet_grid(year(data_calendar$Group)~monthf) +
scale_fill_gradient(low="#ffffff", high="#0052cc") +
labs(fill = "Incidents", x="", y="") +
ggtitle("Calendar Heatmap of San Francisco Police Department Incident Reports 2018 - 2019") +
theme_bw(base_size=10)+
theme(plot.title = element_text(family = 'Arial',
size=12, face="bold", hjust=0.5),
axis.text = element_text(family = 'Arial'),
legend.text = element_text(family = 'Arial'),
legend.title=element_text(family = 'Arial'),
strip.text = element_text(family = 'Arial', size = 10),
panel.border=element_blank(),
axis.ticks=element_blank(),
strip.background=element_blank(),
legend.position="top",
legend.justification="right",
legend.direction="horizontal",
legend.key.size=unit(0.3,"cm"),
legend.spacing.x=unit(0.1,"cm"))) %>%
layout(annotations = c(list(list(text = "Week of Month",
x = 0.5,
y = -0.1,
xref = "paper",
yref = "paper",
showarrow = F,
font=list(family='Arial', size=13))),
list(list(text = "Day of Week",
x = -0.1,
y = 0.51,
xref = "paper",
yref = "paper",
textangle = -90,
showarrow = F,
font=list(family='Arial', size=13)))))From the Calendar Heatmap, we see that there is generally a higher number of incidents in July and August for both years compared to other months.
We can also narrow down to a particular Incident Category, for instance “Larceny Theft”.
data_calendar_larceny <- data_2 %>% group_by(`Incident Date`, `Incident Category`) %>%
filter(`Incident Category`=='Larceny Theft') %>%
dplyr::summarise(Incidents = dplyr::n()) Download the geometry information of San Fransico Neighbourhoods from DataSF.