I chose to have the focus of my analysis be police incident reports from the City of San Francisco Police Department. My intention was to include analysis of crime clearance statistics, and unfortunately that is not possible and I will get into that dynamic a little during my presentation. Ultimately, I chose to conduct an analysis over the years of crime and reports of crime.
There is no readily available data for before 2003, so I cannot make quantitative statements from my own analysis of crime trends from before 2003 (I would love to based on reporting I have seen; however that would not be my own work).
There are two different data sets; one from 2003-2018 and another from 2018-current. From this change in data sources has come a difference in values for the same variables (this makes certain forms of analysis difficult to produce).
library(httr)
library(jsonlite)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
old_police_files <- list.files(pattern = '^Police_Data_....\\.csv')
old_police_tables <- lapply(old_police_files, read.csv, header = TRUE)
old_police_data <- do.call(rbind , old_police_tables)
present_police_files <- list.files(pattern = 'Present_Police_Data_....\\.csv')
present_police_tables <- lapply(present_police_files, read.csv, header = TRUE)
present_police_data <- do.call(rbind , present_police_tables)
sf_pop <- read.csv('SF_POP.csv')
sf_pop$Population <- 1000 * sf_pop$Population
sf_pop <- sf_pop %>%
separate(Year, c('year', 'month', 'date'), '-') %>%
select(year, Population)
sf_pop$year <- as.integer(sf_pop$year)
sf_pop <- sf_pop %>%
filter(year >= 2003)
old_police_data <- old_police_data %>%
separate(Date, c("month", "day","year"), "/")
old_police_data$year <- as.integer(old_police_data$year)
old_counts_and_incidents <- old_police_data %>%
group_by(year) %>%
summarize(count_reports = as.double(n()), count_incidents = as.double(length(unique(IncidntNum))))
new_counts_and_incidents <- present_police_data %>%
rename('year' = 'Incident.Year') %>%
group_by(year) %>%
summarize(count_reports = as.double(n()), count_incidents = as.double(length(unique(Incident.Number))))
counts_and_incidents <- rbind(new_counts_and_incidents, old_counts_and_incidents)
counts_and_incidents <- counts_and_incidents %>%
group_by(year) %>%
mutate(ratio = count_reports/count_incidents) %>%
arrange(desc(year))
knitr::kable(counts_and_incidents)
| year | count_reports | count_incidents | ratio |
|---|---|---|---|
| 2023 | 44748 | 32720 | 1.367604 |
| 2022 | 134843 | 98483 | 1.369201 |
| 2021 | 128775 | 93640 | 1.375214 |
| 2020 | 118313 | 84351 | 1.402627 |
| 2019 | 148272 | 108330 | 1.368707 |
| 2018 | 152707 | 111528 | 1.369226 |
| 2017 | 149487 | 119633 | 1.249546 |
| 2016 | 145994 | 115390 | 1.265222 |
| 2015 | 151459 | 120623 | 1.255640 |
| 2014 | 144844 | 114664 | 1.263204 |
| 2013 | 147664 | 113518 | 1.300798 |
| 2012 | 135464 | 106708 | 1.269483 |
| 2011 | 126713 | 98883 | 1.281444 |
| 2010 | 127758 | 99311 | 1.286444 |
| 2009 | 134309 | 105833 | 1.269065 |
| 2008 | 135242 | 111653 | 1.211271 |
| 2007 | 131771 | 109297 | 1.205623 |
| 2006 | 131856 | 112825 | 1.168677 |
| 2005 | 137048 | 112695 | 1.216096 |
| 2004 | 142054 | 113110 | 1.255893 |
| 2003 | 142803 | 114538 | 1.246774 |
# I did not include 2023 just yet, because the year is only a about 1/3 of the
# way over and would not contribute to reliable comparisions for raw counts
counts_and_incidents_no_2023 <- subset(counts_and_incidents, year != 2023)
ggplot(counts_and_incidents_no_2023, aes(x= year, y= count_incidents)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Incidents per Year', title = 'Incidents per Year Since 2003')
ggplot(counts_and_incidents, aes(x= year, y= ratio)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')
years_and_counts <- full_join(sf_pop, counts_and_incidents, by = 'year')
years_and_counts <- years_and_counts %>%
mutate(reports_per_100k = 100000 * count_reports/Population, incidents_per_100k = 100000 * count_incidents/Population) %>%
arrange(desc(year))
knitr::kable(years_and_counts)
| year | Population | count_reports | count_incidents | ratio | reports_per_100k | incidents_per_100k |
|---|---|---|---|---|---|---|
| 2023 | NA | 44748 | 32720 | 1.367604 | NA | NA |
| 2022 | 808437 | 134843 | 98483 | 1.369201 | 16679.47 | 12181.90 |
| 2021 | 811253 | 128775 | 93640 | 1.375214 | 15873.59 | 11542.64 |
| 2020 | 870393 | 118313 | 84351 | 1.402627 | 13593.06 | 9691.14 |
| 2019 | 878826 | 148272 | 108330 | 1.368707 | 16871.60 | 12326.67 |
| 2018 | 879676 | 152707 | 111528 | 1.369226 | 17359.46 | 12678.30 |
| 2017 | 877471 | 149487 | 119633 | 1.249546 | 17036.12 | 13633.84 |
| 2016 | 871343 | 145994 | 115390 | 1.265222 | 16755.06 | 13242.78 |
| 2015 | 863237 | 151459 | 120623 | 1.255640 | 17545.47 | 13973.34 |
| 2014 | 850918 | 144844 | 114664 | 1.263204 | 17022.09 | 13475.33 |
| 2013 | 839695 | 147664 | 113518 | 1.300798 | 17585.43 | 13518.96 |
| 2012 | 828963 | 135464 | 106708 | 1.269483 | 16341.38 | 12872.47 |
| 2011 | 815694 | 126713 | 98883 | 1.281444 | 15534.38 | 12122.56 |
| 2010 | 805519 | 127758 | 99311 | 1.286444 | 15860.33 | 12328.82 |
| 2009 | 815184 | 134309 | 105833 | 1.269065 | 16475.91 | 12982.71 |
| 2008 | 807904 | 135242 | 111653 | 1.211271 | 16739.86 | 13820.08 |
| 2007 | 799185 | 131771 | 109297 | 1.205623 | 16488.17 | 13676.06 |
| 2006 | 786187 | 131856 | 112825 | 1.168677 | 16771.58 | 14350.91 |
| 2005 | 777835 | 137048 | 112695 | 1.216096 | 17619.16 | 14488.29 |
| 2004 | 773556 | 142054 | 113110 | 1.255893 | 18363.76 | 14622.08 |
| 2003 | 775663 | 142803 | 114538 | 1.246774 | 18410.44 | 14766.46 |
ggplot(years_and_counts, aes(x= year, y= incidents_per_100k)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Incidents per Year per 100k Residents', title = 'Incidents per Year per 100k Residents Since 2003')
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).
This is not possible for over time comparison and that has to do with how SF records “Resolutions” to incidents. There is no easily accessible (I could not find any after scouring SFPD’s website and SF City and County Documentation) documentation on what each resolution type meant for the older data, and then the classifications were changed with no explanation. Due to this lack of transparency, it is difficult to track crime clearance rates over time. It does seem that this is data that does exist internally and it is not open sourced.
new_outcomes <- present_police_data %>%
group_by(Resolution) %>%
summarize(counts = n())
unique(new_outcomes$Resolution)
## [1] "Cite or Arrest Adult" "Exceptional Adult" "Open or Active"
## [4] "Unfounded"
old_outcomes <- old_police_data %>%
group_by(Resolution) %>%
summarize(count = n())
unique(tolower(old_outcomes$Resolution))
## [1] "arrest, booked"
## [2] "arrest, cited"
## [3] "complainant refuses to prosecute"
## [4] "district attorney refuses to prosecute"
## [5] "exceptional clearance"
## [6] "located"
## [7] "none"
## [8] "not prosecuted"
## [9] "prosecuted by outside agency"
## [10] "prosecuted for lesser offense"
## [11] "psychopathic case"
## [12] "unfounded"
I am going to have to do some manual work to make a full cross walk because if just use the DataSF crosswalk that leaves me with about 6800 obersvations without a crime category. Not to mention, the missing crime categories occur more often in the earlier years which would slightly skew results in the direction of the dangerous SF narrative.
# This incident codes crosswalk comes from DataSF and using it leads to missing
# crime categories. Essentially, this crosswalk is not sufficient and needs to be
# improved and the process will unfortunately be largely manual.
#incident_codes <- jsonlite::fromJSON('https://data.sfgov.org/resource/ci9u-8awy.json?$limit=1012')
#incident_codes$inc_code <- as.integer(incident_codes$inc_code)
#incident_codes <- incident_codes %>%
# rename(Incident.Code = inc_code, Category = category, Subcategory = subcategory)
incident_codes <- present_police_data %>%
select(Incident.Code, Incident.Category) %>%
unique()
# I could continue this process, but this gets the number of observations missing
# a category down to a little less the 800, which is significantly better than the
# 6793 observations missing a category when you use DataSF's proprietary crosswalk
# and truthfully this is a bit of tedious process.
row1 = c(Incident.Code = 11010, Incident.Category = 'Stolen Property')
row2 = c(Incident.Code = 28090, Incident.Category = 'Vandalism')
row3 = c(Incident.Code = 30090, Incident.Category = 'Other Miscellaneous')
row4 = c(Incident.Code = 0, Incident.Category = 'Other Miscellaneous')
row5 = c(Incident.Code = 6311, Incident.Category = 'Larceny Theft')
row6 = c(Incident.Code = 16644, Incident.Category = 'Drug Violation')
incident_codes <- rbind(incident_codes, row1, row2, row3, row4, row5, row6)
incident_codes$Incident.Code <- as.integer(incident_codes$Incident.Code)
# Combined Drug Offense and Drug Violation into one category
incident_codes$Incident.Category[incident_codes$Incident.Category == 'Drug Offense'] <- 'Drug Violation'
# lumped the blank category in with miscellaneous
incident_codes$Incident.Category[incident_codes$Incident.Category == ''] <- 'Other Miscellaneous'
# The manual process of finding the missing categories:
old_incident_codes <- old_police_data %>%
select(Incident.Code, Category) %>%
filter()
unique(old_incident_codes$Category)
## [1] "MISSING PERSON" "WARRANTS"
## [3] "DRUG/NARCOTIC" "ASSAULT"
## [5] "LARCENY/THEFT" "SECONDARY CODES"
## [7] "FORGERY/COUNTERFEITING" "NON-CRIMINAL"
## [9] "ROBBERY" "OTHER OFFENSES"
## [11] "VEHICLE THEFT" "SUSPICIOUS OCC"
## [13] "WEAPON LAWS" "EMBEZZLEMENT"
## [15] "FRAUD" "KIDNAPPING"
## [17] "VANDALISM" "BURGLARY"
## [19] "DISORDERLY CONDUCT" "LOITERING"
## [21] "SEX OFFENSES, FORCIBLE" "DRUNKENNESS"
## [23] "TRESPASS" "BAD CHECKS"
## [25] "STOLEN PROPERTY" "PROSTITUTION"
## [27] "DRIVING UNDER THE INFLUENCE" "ARSON"
## [29] "EXTORTION" "BRIBERY"
## [31] "LIQUOR LAWS" "SUICIDE"
## [33] "GAMBLING" "PORNOGRAPHY/OBSCENE MAT"
## [35] "SEX OFFENSES, NON FORCIBLE" "RECOVERED VEHICLE"
## [37] "TREA"
is_it_in_old <- old_incident_codes %>%
filter(Incident.Code == 'insert integer value Incident.Code')
categories <- unique(incident_codes$Incident.Category)
type_old <- old_police_data %>%
select(year, IncidntNum, Incident.Code) %>%
rename(Incident.Number = IncidntNum, Incident.Year = year)
type_new <- present_police_data %>%
select(Incident.Year, Incident.Number, Incident.Code)
total_type <- bind_rows(type_old, type_new)
incidents_with_types <- left_join(total_type, incident_codes)
## Joining with `by = join_by(Incident.Code)`
# -------------------------------
# Checking the missing values:
dist_missing_types <- incident_codes %>%
filter(is.na(Incident.Category)) %>%
#group_by(Incident.Year) %>%
summarise(n())
missing_types <- incident_codes %>%
filter(is.na(Incident.Category)) %>%
group_by(Incident.Code) %>%
summarize(n())
# ------------------------------
counts_by_category <- incidents_with_types %>%
group_by(Incident.Year, Incident.Category) %>%
summarise(count_reports = as.double(n()), count_incidents = as.double(length(unique(Incident.Number)))) %>%
mutate(ratio = count_reports/count_incidents) %>%
arrange(Incident.Year, desc(count_reports))
## `summarise()` has grouped output by 'Incident.Year'. You can override using the
## `.groups` argument.
counts_by_category <- counts_by_category %>%
rename(year = Incident.Year)
year_pop_counts_categories <- full_join(sf_pop, counts_by_category, by = 'year')
counts_by_category <- year_pop_counts_categories %>%
group_by(year, Incident.Category) %>%
mutate(incidents_per_100k = 100000 * count_incidents/Population) %>%
arrange(year, desc(count_reports))
counts_by_category_drug_violations <- counts_by_category %>%
filter(Incident.Category == 'Drug Violation')
# Overall Incidents over Time
ggplot(counts_by_category_drug_violations, aes(x= year, y= count_incidents)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Drug Violations per Year', title = 'Drug Violations per Year Since 2003')
# Incidents Per 100k
ggplot(counts_by_category_drug_violations, aes(x= year, y= incidents_per_100k)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Drug Violations per Year per 100k Residents', title = 'Drug Violations per Year per 100k Residents Since 2003')
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).
# Ratio of Reports vs Actual Incidents
ggplot(counts_by_category_drug_violations, aes(x= year, y= ratio)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')
counts_by_category_larceny <- counts_by_category %>%
filter(Incident.Category == 'Larceny Theft')
# Overall Incidents over Time
ggplot(counts_by_category_larceny, aes(x= year, y= count_incidents)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Larceny Theft Incidents per Year', title = 'Larceny Theft Incidents per Year Since 2003')
# Incidents Per 100k
ggplot(counts_by_category_larceny, aes(x= year, y= incidents_per_100k)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Larceny Theft Incidents per Year per 100k Residents', title = 'Larceny Theft Incidents per Year per 100k Residents Since 2003')
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).
# Ratio of Reports vs Actual Incidents
ggplot(counts_by_category_larceny, aes(x= year, y= ratio)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')
counts_by_category_rape <- counts_by_category %>%
filter(Incident.Category == 'Rape')
# Overall Incidents over Time
ggplot(counts_by_category_rape, aes(x= year, y= count_incidents)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Rapes per Year', title = 'Rapes per Year Since 2003')
# Incidents Per 100k
ggplot(counts_by_category_rape, aes(x= year, y= incidents_per_100k)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Rapes per Year per 100k Residents', title = 'Rapes per Year per 100k Residents Since 2003')
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).
# Ratio of Reports vs Actual Incidents
ggplot(counts_by_category_rape, aes(x= year, y= ratio)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')
counts_by_category_assault <- counts_by_category %>%
filter(Incident.Category == 'Assault')
ggplot(counts_by_category_assault, aes(x= year, y= count_incidents)) +
geom_point() +
geom_line() +
labs(x = 'Year', y = 'Assaults per Year', title = 'Assaults per Year Since 2003')
It is extremely hard to track homicide with DataSF and SFPD data. The data on homicide only goes back to 2018, not to mention and incident does not necessarily translate to a singular homicide and rather could translate to multiple homicides at one time. So, given the homicides are a rarer form of crime and the fact that one incident can actually mean multiple deaths sort of mystifies this already small set of results. Also, manslaughter is classified under the general homicide umbrella, likely necessitating a sub-category analysis on an already small category. I know accurate Homicide data exists under SFPD’s ownership, but it is not readily available to the public.
counts_by_category_homicide <- counts_by_category %>%
filter(Incident.Category == 'Homicide')
knitr::kable(counts_by_category_homicide)
| year | Population | Incident.Category | count_reports | count_incidents | ratio | incidents_per_100k |
|---|---|---|---|---|---|---|
| 2018 | 879676 | Homicide | 24 | 15 | 1.600000 | 1.705173 |
| 2019 | 878826 | Homicide | 15 | 11 | 1.363636 | 1.251670 |
| 2020 | 870393 | Homicide | 12 | 11 | 1.090909 | 1.263797 |
| 2021 | 811253 | Homicide | 14 | 10 | 1.400000 | 1.232661 |
| 2022 | 808437 | Homicide | 29 | 24 | 1.208333 | 2.968691 |
| 2023 | NA | Homicide | 14 | 10 | 1.400000 | NA |
A continuation of this analysis could use this file that comes from DataSF detailing the cities budget over the years, with a specific focus on funding of the Police.
budget <- read.csv('Budget.csv')
The trends that I observed were as follows:
The incidents of crime have gone up yearly, over the timeframe, with the exception of the pandemic. However, when we look at crime per 100K residents, crime has gone down generally, with a large increase during the pandemic.
What is especially of note is that as crime per 100k residents, the reporting of crime increased. It is difficult to comment on exactly why that is because of lack of clearance rates and the inconsistency with the “Resolution” variable in both police incident report data sets.
Another dynamic I noticed is that there was an general decrease in violent crime incidents and a general increase in Violent Crimes. My opinion is that violence is dangerous and non-violence is not dangerous. If Violent Crime is showing decrease then SF is likely becoming less dangerous.