Data 607 Final Project

Introduction:

I chose to have the focus of my analysis be police incident reports from the City of San Francisco Police Department. My intention was to include analysis of crime clearance statistics, and unfortunately that is not possible and I will get into that dynamic a little during my presentation. Ultimately, I chose to conduct an analysis over the years of crime and reports of crime.

Limitations:

There is no readily available data for before 2003, so I cannot make quantitative statements from my own analysis of crime trends from before 2003 (I would love to based on reporting I have seen; however that would not be my own work).

There are two different data sets; one from 2003-2018 and another from 2018-current. From this change in data sources has come a difference in values for the same variables (this makes certain forms of analysis difficult to produce).

Necessary Packages

library(httr)
library(jsonlite)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Police Incident Data: 2003 - 2017

old_police_files  <- list.files(pattern = '^Police_Data_....\\.csv')

old_police_tables <- lapply(old_police_files, read.csv, header = TRUE)

old_police_data <- do.call(rbind , old_police_tables)

Police Incident Data: 2018 - Current

present_police_files  <- list.files(pattern = 'Present_Police_Data_....\\.csv')

present_police_tables <- lapply(present_police_files, read.csv, header = TRUE)

present_police_data <- do.call(rbind , present_police_tables)

SF Yearly Population

sf_pop <- read.csv('SF_POP.csv')

sf_pop$Population <- 1000 * sf_pop$Population

sf_pop <- sf_pop %>%
  separate(Year, c('year', 'month', 'date'), '-') %>%
  select(year, Population)

sf_pop$year <- as.integer(sf_pop$year)

sf_pop <- sf_pop %>%
  filter(year >= 2003)

Joining New and Old Incident Counts, Reports, and the ratio of the two:

old_police_data <- old_police_data %>%
  separate(Date, c("month", "day","year"), "/")

old_police_data$year <- as.integer(old_police_data$year)

old_counts_and_incidents <- old_police_data %>%
  group_by(year) %>%
  summarize(count_reports = as.double(n()), count_incidents = as.double(length(unique(IncidntNum))))

new_counts_and_incidents <- present_police_data %>%
  rename('year' = 'Incident.Year') %>%
  group_by(year) %>%
  summarize(count_reports = as.double(n()), count_incidents = as.double(length(unique(Incident.Number))))

counts_and_incidents <- rbind(new_counts_and_incidents, old_counts_and_incidents)

counts_and_incidents <- counts_and_incidents %>%
  group_by(year) %>%
  mutate(ratio = count_reports/count_incidents) %>%
  arrange(desc(year))

knitr::kable(counts_and_incidents)

year	count_reports	count_incidents	ratio
2023	44748	32720	1.367604
2022	134843	98483	1.369201
2021	128775	93640	1.375214
2020	118313	84351	1.402627
2019	148272	108330	1.368707
2018	152707	111528	1.369226
2017	149487	119633	1.249546
2016	145994	115390	1.265222
2015	151459	120623	1.255640
2014	144844	114664	1.263204
2013	147664	113518	1.300798
2012	135464	106708	1.269483
2011	126713	98883	1.281444
2010	127758	99311	1.286444
2009	134309	105833	1.269065
2008	135242	111653	1.211271
2007	131771	109297	1.205623
2006	131856	112825	1.168677
2005	137048	112695	1.216096
2004	142054	113110	1.255893
2003	142803	114538	1.246774

Visualizations: Incidents Over Time

# I did not include 2023 just yet, because the year is only a about 1/3 of the 
# way over and would not contribute to reliable comparisions for raw counts

counts_and_incidents_no_2023 <- subset(counts_and_incidents, year != 2023)

ggplot(counts_and_incidents_no_2023, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Incidents per Year', title = 'Incidents per Year Since 2003')

Visualization: Ratio of reports vs acutal incidents

ggplot(counts_and_incidents, aes(x= year, y= ratio)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')

Adding population to counts and incidents

years_and_counts <- full_join(sf_pop, counts_and_incidents, by = 'year')

Crime per 100K

years_and_counts <- years_and_counts %>%
  mutate(reports_per_100k = 100000 * count_reports/Population, incidents_per_100k = 100000 * count_incidents/Population) %>%
  arrange(desc(year))

knitr::kable(years_and_counts)

year	Population	count_reports	count_incidents	ratio	reports_per_100k	incidents_per_100k
2023	NA	44748	32720	1.367604	NA	NA
2022	808437	134843	98483	1.369201	16679.47	12181.90
2021	811253	128775	93640	1.375214	15873.59	11542.64
2020	870393	118313	84351	1.402627	13593.06	9691.14
2019	878826	148272	108330	1.368707	16871.60	12326.67
2018	879676	152707	111528	1.369226	17359.46	12678.30
2017	877471	149487	119633	1.249546	17036.12	13633.84
2016	871343	145994	115390	1.265222	16755.06	13242.78
2015	863237	151459	120623	1.255640	17545.47	13973.34
2014	850918	144844	114664	1.263204	17022.09	13475.33
2013	839695	147664	113518	1.300798	17585.43	13518.96
2012	828963	135464	106708	1.269483	16341.38	12872.47
2011	815694	126713	98883	1.281444	15534.38	12122.56
2010	805519	127758	99311	1.286444	15860.33	12328.82
2009	815184	134309	105833	1.269065	16475.91	12982.71
2008	807904	135242	111653	1.211271	16739.86	13820.08
2007	799185	131771	109297	1.205623	16488.17	13676.06
2006	786187	131856	112825	1.168677	16771.58	14350.91
2005	777835	137048	112695	1.216096	17619.16	14488.29
2004	773556	142054	113110	1.255893	18363.76	14622.08
2003	775663	142803	114538	1.246774	18410.44	14766.46

Visualization: Crime Incidents per 100k

ggplot(years_and_counts, aes(x= year, y= incidents_per_100k)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Incidents per Year per 100k Residents', title = 'Incidents per Year per 100k Residents Since 2003')

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 row containing missing values (`geom_line()`).

Arrest Rate Rate

This is not possible for over time comparison and that has to do with how SF records “Resolutions” to incidents. There is no easily accessible (I could not find any after scouring SFPD’s website and SF City and County Documentation) documentation on what each resolution type meant for the older data, and then the classifications were changed with no explanation. Due to this lack of transparency, it is difficult to track crime clearance rates over time. It does seem that this is data that does exist internally and it is not open sourced.

new_outcomes <- present_police_data %>%
  group_by(Resolution) %>%
  summarize(counts = n())

unique(new_outcomes$Resolution)

## [1] "Cite or Arrest Adult" "Exceptional Adult"    "Open or Active"      
## [4] "Unfounded"

old_outcomes <- old_police_data %>%
  group_by(Resolution) %>%
  summarize(count = n())

unique(tolower(old_outcomes$Resolution))

##  [1] "arrest, booked"                        
##  [2] "arrest, cited"                         
##  [3] "complainant refuses to prosecute"      
##  [4] "district attorney refuses to prosecute"
##  [5] "exceptional clearance"                 
##  [6] "located"                               
##  [7] "none"                                  
##  [8] "not prosecuted"                        
##  [9] "prosecuted by outside agency"          
## [10] "prosecuted for lesser offense"         
## [11] "psychopathic case"                     
## [12] "unfounded"

Incident Code Crosswalk:

I am going to have to do some manual work to make a full cross walk because if just use the DataSF crosswalk that leaves me with about 6800 obersvations without a crime category. Not to mention, the missing crime categories occur more often in the earlier years which would slightly skew results in the direction of the dangerous SF narrative.

# This incident codes crosswalk comes from DataSF and using it leads to missing 
# crime categories. Essentially, this crosswalk is not sufficient and needs to be 
# improved and the process will unfortunately be largely manual.

#incident_codes <- jsonlite::fromJSON('https://data.sfgov.org/resource/ci9u-8awy.json?$limit=1012')

#incident_codes$inc_code <- as.integer(incident_codes$inc_code)

#incident_codes <- incident_codes %>%
#  rename(Incident.Code = inc_code, Category = category, Subcategory = subcategory)

incident_codes <- present_police_data %>%
  select(Incident.Code, Incident.Category) %>%
  unique()

# I could continue this process, but this gets the number of observations missing
# a category down to a little less the 800, which is significantly better than the 
# 6793 observations missing a category when you use DataSF's proprietary crosswalk
# and truthfully this is a bit of tedious process.

row1 = c(Incident.Code = 11010, Incident.Category = 'Stolen Property')
row2 = c(Incident.Code = 28090, Incident.Category = 'Vandalism')
row3 = c(Incident.Code = 30090, Incident.Category = 'Other Miscellaneous')
row4 = c(Incident.Code = 0, Incident.Category = 'Other Miscellaneous')
row5 = c(Incident.Code = 6311, Incident.Category = 'Larceny Theft')
row6 = c(Incident.Code = 16644, Incident.Category = 'Drug Violation')

incident_codes <- rbind(incident_codes, row1, row2, row3, row4, row5, row6)

incident_codes$Incident.Code <- as.integer(incident_codes$Incident.Code)

# Combined Drug Offense and Drug Violation into one category

incident_codes$Incident.Category[incident_codes$Incident.Category == 'Drug Offense'] <- 'Drug Violation'

# lumped the blank category in with miscellaneous

incident_codes$Incident.Category[incident_codes$Incident.Category == ''] <- 'Other Miscellaneous'

# The manual process of finding the missing categories:

old_incident_codes <- old_police_data %>%
  select(Incident.Code, Category) %>%
  filter()

unique(old_incident_codes$Category)

##  [1] "MISSING PERSON"              "WARRANTS"                   
##  [3] "DRUG/NARCOTIC"               "ASSAULT"                    
##  [5] "LARCENY/THEFT"               "SECONDARY CODES"            
##  [7] "FORGERY/COUNTERFEITING"      "NON-CRIMINAL"               
##  [9] "ROBBERY"                     "OTHER OFFENSES"             
## [11] "VEHICLE THEFT"               "SUSPICIOUS OCC"             
## [13] "WEAPON LAWS"                 "EMBEZZLEMENT"               
## [15] "FRAUD"                       "KIDNAPPING"                 
## [17] "VANDALISM"                   "BURGLARY"                   
## [19] "DISORDERLY CONDUCT"          "LOITERING"                  
## [21] "SEX OFFENSES, FORCIBLE"      "DRUNKENNESS"                
## [23] "TRESPASS"                    "BAD CHECKS"                 
## [25] "STOLEN PROPERTY"             "PROSTITUTION"               
## [27] "DRIVING UNDER THE INFLUENCE" "ARSON"                      
## [29] "EXTORTION"                   "BRIBERY"                    
## [31] "LIQUOR LAWS"                 "SUICIDE"                    
## [33] "GAMBLING"                    "PORNOGRAPHY/OBSCENE MAT"    
## [35] "SEX OFFENSES, NON FORCIBLE"  "RECOVERED VEHICLE"          
## [37] "TREA"

is_it_in_old <- old_incident_codes %>%
  filter(Incident.Code == 'insert integer value Incident.Code')

categories <- unique(incident_codes$Incident.Category)

Analysis of Crime Type:

type_old <- old_police_data %>%
  select(year, IncidntNum, Incident.Code) %>%
  rename(Incident.Number = IncidntNum, Incident.Year = year)

type_new <- present_police_data %>%
  select(Incident.Year, Incident.Number, Incident.Code)

total_type <- bind_rows(type_old, type_new)

incidents_with_types <- left_join(total_type, incident_codes)

## Joining with `by = join_by(Incident.Code)`

# -------------------------------
# Checking the missing values:

dist_missing_types <- incident_codes %>%
  filter(is.na(Incident.Category)) %>%
  #group_by(Incident.Year) %>%
  summarise(n())

missing_types <- incident_codes %>%
  filter(is.na(Incident.Category)) %>%
  group_by(Incident.Code) %>%
  summarize(n())
# ------------------------------

counts_by_category <- incidents_with_types %>%
  group_by(Incident.Year, Incident.Category) %>%
  summarise(count_reports = as.double(n()), count_incidents = as.double(length(unique(Incident.Number)))) %>%
  mutate(ratio = count_reports/count_incidents) %>%
  arrange(Incident.Year, desc(count_reports))

## `summarise()` has grouped output by 'Incident.Year'. You can override using the
## `.groups` argument.

counts_by_category <- counts_by_category %>%
  rename(year = Incident.Year)

year_pop_counts_categories <- full_join(sf_pop, counts_by_category, by = 'year')

counts_by_category <- year_pop_counts_categories %>%
  group_by(year, Incident.Category) %>%
  mutate(incidents_per_100k = 100000 * count_incidents/Population) %>%
  arrange(year, desc(count_reports))

Visualization: Drug Violations

counts_by_category_drug_violations <- counts_by_category %>%
  filter(Incident.Category == 'Drug Violation')

# Overall Incidents over Time

ggplot(counts_by_category_drug_violations, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Drug Violations per Year', title = 'Drug Violations per Year Since 2003')

# Incidents Per 100k

ggplot(counts_by_category_drug_violations, aes(x= year, y= incidents_per_100k)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Drug Violations per Year per 100k Residents', title = 'Drug Violations per Year per 100k Residents Since 2003')

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 row containing missing values (`geom_line()`).

# Ratio of Reports vs Actual Incidents

ggplot(counts_by_category_drug_violations, aes(x= year, y= ratio)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')

Visualization: Larceny Theft

counts_by_category_larceny <- counts_by_category %>%
  filter(Incident.Category == 'Larceny Theft')

# Overall Incidents over Time

ggplot(counts_by_category_larceny, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Larceny Theft Incidents per Year', title = 'Larceny Theft Incidents per Year Since 2003')

# Incidents Per 100k

ggplot(counts_by_category_larceny, aes(x= year, y= incidents_per_100k)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Larceny Theft Incidents per Year per 100k Residents', title = 'Larceny Theft Incidents per Year per 100k Residents Since 2003')

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 row containing missing values (`geom_line()`).

# Ratio of Reports vs Actual Incidents

ggplot(counts_by_category_larceny, aes(x= year, y= ratio)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')

Visualization: Rape

counts_by_category_rape <- counts_by_category %>%
  filter(Incident.Category == 'Rape')

# Overall Incidents over Time

ggplot(counts_by_category_rape, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Rapes per Year', title = 'Rapes per Year Since 2003')

# Incidents Per 100k

ggplot(counts_by_category_rape, aes(x= year, y= incidents_per_100k)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Rapes per Year per 100k Residents', title = 'Rapes per Year per 100k Residents Since 2003')

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 row containing missing values (`geom_line()`).

# Ratio of Reports vs Actual Incidents

ggplot(counts_by_category_rape, aes(x= year, y= ratio)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Ratio of Reports Filed v.s. Actual Incidents', title = 'The Ratio of Reported Crimes to Actual Incidents Since 2003')

Assault

counts_by_category_assault <- counts_by_category %>%
  filter(Incident.Category == 'Assault')

ggplot(counts_by_category_assault, aes(x= year, y= count_incidents)) +
  geom_point() +
  geom_line() +
  labs(x = 'Year', y = 'Assaults per Year', title = 'Assaults per Year Since 2003')

Homicide

It is extremely hard to track homicide with DataSF and SFPD data. The data on homicide only goes back to 2018, not to mention and incident does not necessarily translate to a singular homicide and rather could translate to multiple homicides at one time. So, given the homicides are a rarer form of crime and the fact that one incident can actually mean multiple deaths sort of mystifies this already small set of results. Also, manslaughter is classified under the general homicide umbrella, likely necessitating a sub-category analysis on an already small category. I know accurate Homicide data exists under SFPD’s ownership, but it is not readily available to the public.

counts_by_category_homicide <- counts_by_category %>%
  filter(Incident.Category == 'Homicide')

knitr::kable(counts_by_category_homicide)

year	Population	Incident.Category	count_reports	count_incidents	ratio	incidents_per_100k
2018	879676	Homicide	24	15	1.600000	1.705173
2019	878826	Homicide	15	11	1.363636	1.251670
2020	870393	Homicide	12	11	1.090909	1.263797
2021	811253	Homicide	14	10	1.400000	1.232661
2022	808437	Homicide	29	24	1.208333	2.968691
2023	NA	Homicide	14	10	1.400000	NA

SF City Budget

A continuation of this analysis could use this file that comes from DataSF detailing the cities budget over the years, with a specific focus on funding of the Police.

budget <- read.csv('Budget.csv')

Conclusion:

The trends that I observed were as follows:

The incidents of crime have gone up yearly, over the timeframe, with the exception of the pandemic. However, when we look at crime per 100K residents, crime has gone down generally, with a large increase during the pandemic.

What is especially of note is that as crime per 100k residents, the reporting of crime increased. It is difficult to comment on exactly why that is because of lack of clearance rates and the inconsistency with the “Resolution” variable in both police incident report data sets.

Another dynamic I noticed is that there was an general decrease in violent crime incidents and a general increase in Violent Crimes. My opinion is that violence is dangerous and non-violence is not dangerous. If Violent Crime is showing decrease then SF is likely becoming less dangerous.