Introduction:

I chose to have the focus of my analysis be police incident reports from the City of San Francisco Police Department. My intention was to include analysis of crime clearance statistics, and unfortunately that is not possible and I will get into that dynamic a little during my presentation. Ultimately, I chose to conduct an analysis over the years of crime and reports of crime.

Limitations:

There is no readily available data for before 2003, so I cannot make quantitative statements from my own analysis of crime trends from before 2003 (I would love to based on reporting I have seen; however that would not be my own work).

There are two different data sets; one from 2003-2018 and another from 2018-current. From this change in data sources has come a difference in values for the same variables (this makes certain forms of analysis difficult to produce).

Necessary Packages

library(httr)
library(jsonlite)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Police Incident Data: 2003 - 2017

old_police_files  <- list.files(pattern = '^Police_Data_....\\.csv')

old_police_tables <- lapply(old_police_files, read.csv, header = TRUE)

old_police_data <- do.call(rbind , old_police_tables)

Police Incident Data: 2018 - Current

present_police_files  <- list.files(pattern = 'Present_Police_Data_....\\.csv')

present_police_tables <- lapply(present_police_files, read.csv, header = TRUE)

present_police_data <- do.call(rbind , present_police_tables)

SF Yearly Population

sf_pop <- read.csv('SF_POP.csv')

sf_pop$Population <- 1000 * sf_pop$Population

sf_pop <- sf_pop %>%
  separate(Year, c('year', 'month', 'date'), '-') %>%
  select(year, Population)

sf_pop$year <- as.integer(sf_pop$year)

sf_pop <- sf_pop %>%
  filter(year >= 2003)

Joining New and Old Incident Counts, Reports, and the ratio of the two:

old_police_data <- old_police_data %>%
  separate(Date, c("month", "day","year"), "/")

old_police_data$year <- as.integer(old_police_data$year)

old_counts_and_incidents <- old_police_data %>%
  group_by(year) %>%
  summarize(count_reports = as.double(n()), count_incidents = as.double(length(unique(IncidntNum))))

new_counts_and_incidents <- present_police_data %>%
  rename('year' = 'Incident.Year') %>%
  group_by(year) %>%
  summarize(count_reports = as.double(n()), count_incidents = as.double(length(unique(Incident.Number))))

counts_and_incidents <- rbind(new_counts_and_incidents, old_counts_and_incidents)

counts_and_incidents <- counts_and_incidents %>%
  group_by(year) %>%
  mutate(ratio = count_reports/count_incidents) %>%
  arrange(desc(year))

knitr::kable(counts_and_incidents)
year count_reports count_incidents ratio
2023 44748 32720 1.367604
2022 134843 98483 1.369201
2021 128775 93640 1.375214
2020 118313 84351 1.402627
2019 148272 108330 1.368707
2018 152707 111528 1.369226
2017 149487 119633 1.249546
2016 145994 115390 1.265222
2015 151459 120623 1.255640
2014 144844 114664 1.263204
2013 147664 113518 1.300798
2012 135464 106708 1.269483
2011 126713 98883 1.281444
2010 127758 99311 1.286444
2009 134309 105833 1.269065
2008 135242 111653 1.211271
2007 131771 109297 1.205623
2006 131856 112825 1.168677
2005 137048 112695 1.216096
2004 142054 113110 1.255893
2003 142803 114538 1.246774

Visualizations: Incidents Over Time

# I did not include 2023 just yet, because the year is only a about 1/3 of the 
# way over and would not contribute to reliable comparisions for raw counts

counts_and_incidents_no_2023 <- subset(counts_and_incidents, year != 2023)

ggplot(counts_and_incidents_no_2023, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Incidents per Year', title = 'Incidents per Year Since 2003')

Visualization: Ratio of reports vs acutal incidents

ggplot(counts_and_incidents, aes(x= year, y= ratio)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')

Adding population to counts and incidents

years_and_counts <- full_join(sf_pop, counts_and_incidents, by = 'year')

Crime per 100K

years_and_counts <- years_and_counts %>%
  mutate(reports_per_100k = 100000 * count_reports/Population, incidents_per_100k = 100000 * count_incidents/Population) %>%
  arrange(desc(year))

knitr::kable(years_and_counts)
year Population count_reports count_incidents ratio reports_per_100k incidents_per_100k
2023 NA 44748 32720 1.367604 NA NA
2022 808437 134843 98483 1.369201 16679.47 12181.90
2021 811253 128775 93640 1.375214 15873.59 11542.64
2020 870393 118313 84351 1.402627 13593.06 9691.14
2019 878826 148272 108330 1.368707 16871.60 12326.67
2018 879676 152707 111528 1.369226 17359.46 12678.30
2017 877471 149487 119633 1.249546 17036.12 13633.84
2016 871343 145994 115390 1.265222 16755.06 13242.78
2015 863237 151459 120623 1.255640 17545.47 13973.34
2014 850918 144844 114664 1.263204 17022.09 13475.33
2013 839695 147664 113518 1.300798 17585.43 13518.96
2012 828963 135464 106708 1.269483 16341.38 12872.47
2011 815694 126713 98883 1.281444 15534.38 12122.56
2010 805519 127758 99311 1.286444 15860.33 12328.82
2009 815184 134309 105833 1.269065 16475.91 12982.71
2008 807904 135242 111653 1.211271 16739.86 13820.08
2007 799185 131771 109297 1.205623 16488.17 13676.06
2006 786187 131856 112825 1.168677 16771.58 14350.91
2005 777835 137048 112695 1.216096 17619.16 14488.29
2004 773556 142054 113110 1.255893 18363.76 14622.08
2003 775663 142803 114538 1.246774 18410.44 14766.46

Visualization: Crime Incidents per 100k

ggplot(years_and_counts, aes(x= year, y= incidents_per_100k)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Incidents per Year per 100k Residents', title = 'Incidents per Year per 100k Residents Since 2003')
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).

Arrest Rate Rate

This is not possible for over time comparison and that has to do with how SF records “Resolutions” to incidents. There is no easily accessible (I could not find any after scouring SFPD’s website and SF City and County Documentation) documentation on what each resolution type meant for the older data, and then the classifications were changed with no explanation. Due to this lack of transparency, it is difficult to track crime clearance rates over time. It does seem that this is data that does exist internally and it is not open sourced.

new_outcomes <- present_police_data %>%
  group_by(Resolution) %>%
  summarize(counts = n())

unique(new_outcomes$Resolution)
## [1] "Cite or Arrest Adult" "Exceptional Adult"    "Open or Active"      
## [4] "Unfounded"
old_outcomes <- old_police_data %>%
  group_by(Resolution) %>%
  summarize(count = n())

unique(tolower(old_outcomes$Resolution))
##  [1] "arrest, booked"                        
##  [2] "arrest, cited"                         
##  [3] "complainant refuses to prosecute"      
##  [4] "district attorney refuses to prosecute"
##  [5] "exceptional clearance"                 
##  [6] "located"                               
##  [7] "none"                                  
##  [8] "not prosecuted"                        
##  [9] "prosecuted by outside agency"          
## [10] "prosecuted for lesser offense"         
## [11] "psychopathic case"                     
## [12] "unfounded"

Incident Code Crosswalk:

I am going to have to do some manual work to make a full cross walk because if just use the DataSF crosswalk that leaves me with about 6800 obersvations without a crime category. Not to mention, the missing crime categories occur more often in the earlier years which would slightly skew results in the direction of the dangerous SF narrative.

# This incident codes crosswalk comes from DataSF and using it leads to missing 
# crime categories. Essentially, this crosswalk is not sufficient and needs to be 
# improved and the process will unfortunately be largely manual.

#incident_codes <- jsonlite::fromJSON('https://data.sfgov.org/resource/ci9u-8awy.json?$limit=1012')

#incident_codes$inc_code <- as.integer(incident_codes$inc_code)

#incident_codes <- incident_codes %>%
#  rename(Incident.Code = inc_code, Category = category, Subcategory = subcategory)

incident_codes <- present_police_data %>%
  select(Incident.Code, Incident.Category) %>%
  unique()

# I could continue this process, but this gets the number of observations missing
# a category down to a little less the 800, which is significantly better than the 
# 6793 observations missing a category when you use DataSF's proprietary crosswalk
# and truthfully this is a bit of tedious process.

row1 = c(Incident.Code = 11010, Incident.Category = 'Stolen Property')
row2 = c(Incident.Code = 28090, Incident.Category = 'Vandalism')
row3 = c(Incident.Code = 30090, Incident.Category = 'Other Miscellaneous')
row4 = c(Incident.Code = 0, Incident.Category = 'Other Miscellaneous')
row5 = c(Incident.Code = 6311, Incident.Category = 'Larceny Theft')
row6 = c(Incident.Code = 16644, Incident.Category = 'Drug Violation')

incident_codes <- rbind(incident_codes, row1, row2, row3, row4, row5, row6)

incident_codes$Incident.Code <- as.integer(incident_codes$Incident.Code)

# Combined Drug Offense and Drug Violation into one category

incident_codes$Incident.Category[incident_codes$Incident.Category == 'Drug Offense'] <- 'Drug Violation'

# lumped the blank category in with miscellaneous

incident_codes$Incident.Category[incident_codes$Incident.Category == ''] <- 'Other Miscellaneous'

# The manual process of finding the missing categories:

old_incident_codes <- old_police_data %>%
  select(Incident.Code, Category) %>%
  filter()

unique(old_incident_codes$Category)
##  [1] "MISSING PERSON"              "WARRANTS"                   
##  [3] "DRUG/NARCOTIC"               "ASSAULT"                    
##  [5] "LARCENY/THEFT"               "SECONDARY CODES"            
##  [7] "FORGERY/COUNTERFEITING"      "NON-CRIMINAL"               
##  [9] "ROBBERY"                     "OTHER OFFENSES"             
## [11] "VEHICLE THEFT"               "SUSPICIOUS OCC"             
## [13] "WEAPON LAWS"                 "EMBEZZLEMENT"               
## [15] "FRAUD"                       "KIDNAPPING"                 
## [17] "VANDALISM"                   "BURGLARY"                   
## [19] "DISORDERLY CONDUCT"          "LOITERING"                  
## [21] "SEX OFFENSES, FORCIBLE"      "DRUNKENNESS"                
## [23] "TRESPASS"                    "BAD CHECKS"                 
## [25] "STOLEN PROPERTY"             "PROSTITUTION"               
## [27] "DRIVING UNDER THE INFLUENCE" "ARSON"                      
## [29] "EXTORTION"                   "BRIBERY"                    
## [31] "LIQUOR LAWS"                 "SUICIDE"                    
## [33] "GAMBLING"                    "PORNOGRAPHY/OBSCENE MAT"    
## [35] "SEX OFFENSES, NON FORCIBLE"  "RECOVERED VEHICLE"          
## [37] "TREA"
is_it_in_old <- old_incident_codes %>%
  filter(Incident.Code == 'insert integer value Incident.Code')

categories <- unique(incident_codes$Incident.Category)

Analysis of Crime Type:

type_old <- old_police_data %>%
  select(year, IncidntNum, Incident.Code) %>%
  rename(Incident.Number = IncidntNum, Incident.Year = year)

type_new <- present_police_data %>%
  select(Incident.Year, Incident.Number, Incident.Code)

total_type <- bind_rows(type_old, type_new)

incidents_with_types <- left_join(total_type, incident_codes)
## Joining with `by = join_by(Incident.Code)`
# -------------------------------
# Checking the missing values:

dist_missing_types <- incident_codes %>%
  filter(is.na(Incident.Category)) %>%
  #group_by(Incident.Year) %>%
  summarise(n())

missing_types <- incident_codes %>%
  filter(is.na(Incident.Category)) %>%
  group_by(Incident.Code) %>%
  summarize(n())
# ------------------------------

counts_by_category <- incidents_with_types %>%
  group_by(Incident.Year, Incident.Category) %>%
  summarise(count_reports = as.double(n()), count_incidents = as.double(length(unique(Incident.Number)))) %>%
  mutate(ratio = count_reports/count_incidents) %>%
  arrange(Incident.Year, desc(count_reports))
## `summarise()` has grouped output by 'Incident.Year'. You can override using the
## `.groups` argument.
counts_by_category <- counts_by_category %>%
  rename(year = Incident.Year)

year_pop_counts_categories <- full_join(sf_pop, counts_by_category, by = 'year')

counts_by_category <- year_pop_counts_categories %>%
  group_by(year, Incident.Category) %>%
  mutate(incidents_per_100k = 100000 * count_incidents/Population) %>%
  arrange(year, desc(count_reports))

Visualization: Drug Violations

counts_by_category_drug_violations <- counts_by_category %>%
  filter(Incident.Category == 'Drug Violation')

# Overall Incidents over Time

ggplot(counts_by_category_drug_violations, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Drug Violations per Year', title = 'Drug Violations per Year Since 2003')

# Incidents Per 100k

ggplot(counts_by_category_drug_violations, aes(x= year, y= incidents_per_100k)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Drug Violations per Year per 100k Residents', title = 'Drug Violations per Year per 100k Residents Since 2003')
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).

# Ratio of Reports vs Actual Incidents

ggplot(counts_by_category_drug_violations, aes(x= year, y= ratio)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')

Visualization: Larceny Theft

counts_by_category_larceny <- counts_by_category %>%
  filter(Incident.Category == 'Larceny Theft')

# Overall Incidents over Time

ggplot(counts_by_category_larceny, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Larceny Theft Incidents per Year', title = 'Larceny Theft Incidents per Year Since 2003')

# Incidents Per 100k

ggplot(counts_by_category_larceny, aes(x= year, y= incidents_per_100k)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Larceny Theft Incidents per Year per 100k Residents', title = 'Larceny Theft Incidents per Year per 100k Residents Since 2003')
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).

# Ratio of Reports vs Actual Incidents

ggplot(counts_by_category_larceny, aes(x= year, y= ratio)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')

Visualization: Rape

counts_by_category_rape <- counts_by_category %>%
  filter(Incident.Category == 'Rape')

# Overall Incidents over Time

ggplot(counts_by_category_rape, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Rapes per Year', title = 'Rapes per Year Since 2003')

# Incidents Per 100k

ggplot(counts_by_category_rape, aes(x= year, y= incidents_per_100k)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Rapes per Year per 100k Residents', title = 'Rapes per Year per 100k Residents Since 2003')
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).

# Ratio of Reports vs Actual Incidents

ggplot(counts_by_category_rape, aes(x= year, y= ratio)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')

Assault

counts_by_category_assault <- counts_by_category %>%
  filter(Incident.Category == 'Assault')

ggplot(counts_by_category_assault, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Assaults per Year', title = 'Assaults per Year Since 2003')

Homicide

It is extremely hard to track homicide with DataSF and SFPD data. The data on homicide only goes back to 2018, not to mention and incident does not necessarily translate to a singular homicide and rather could translate to multiple homicides at one time. So, given the homicides are a rarer form of crime and the fact that one incident can actually mean multiple deaths sort of mystifies this already small set of results. Also, manslaughter is classified under the general homicide umbrella, likely necessitating a sub-category analysis on an already small category. I know accurate Homicide data exists under SFPD’s ownership, but it is not readily available to the public.

counts_by_category_homicide <- counts_by_category %>%
  filter(Incident.Category == 'Homicide')

knitr::kable(counts_by_category_homicide)
year Population Incident.Category count_reports count_incidents ratio incidents_per_100k
2018 879676 Homicide 24 15 1.600000 1.705173
2019 878826 Homicide 15 11 1.363636 1.251670
2020 870393 Homicide 12 11 1.090909 1.263797
2021 811253 Homicide 14 10 1.400000 1.232661
2022 808437 Homicide 29 24 1.208333 2.968691
2023 NA Homicide 14 10 1.400000 NA

SF City Budget

A continuation of this analysis could use this file that comes from DataSF detailing the cities budget over the years, with a specific focus on funding of the Police.

budget <- read.csv('Budget.csv')

Conclusion:

The trends that I observed were as follows:

The incidents of crime have gone up yearly, over the timeframe, with the exception of the pandemic. However, when we look at crime per 100K residents, crime has gone down generally, with a large increase during the pandemic.

What is especially of note is that as crime per 100k residents, the reporting of crime increased. It is difficult to comment on exactly why that is because of lack of clearance rates and the inconsistency with the “Resolution” variable in both police incident report data sets.

Another dynamic I noticed is that there was an general decrease in violent crime incidents and a general increase in Violent Crimes. My opinion is that violence is dangerous and non-violence is not dangerous. If Violent Crime is showing decrease then SF is likely becoming less dangerous.