===================== # STEP 1: Set up my environment # =====================
Notes: setting up my R environment by loading ‘tidyverse’ and Stockton data set.
library(tidyverse)
library(lubridate)
library(janitor)
library(scales)
stockton<-read.csv("ca_stockton_2020_04_01.csv")
===================== # STEP 2: Explore the data set # =====================
First, let’s take a look at what columns exist in our stockton data frame.
colnames(stockton)
## [1] "raw_row_number" "date" "division" "subject_age"
## [5] "subject_race" "subject_sex" "officer_id_hash" "type"
## [9] "arrest_made" "citation_issued" "warning_issued" "outcome"
## [13] "search_conducted" "search_basis" "reason_for_stop" "raw_result"
## [17] "raw_search"
How many stops do we have in our dataset?
nrow(stockton)
## [1] 41629
What date range does our data cover?
## [1] "2012-01-01"
to
## [1] "2016-12-31"
===================== # STEP 3. Clean the data # =====================
Our analyses is only relevant for traffic stops so lets make sure our data frame is applicable for such data
unique(stockton$type)
## [1] "vehicular"
===================== # STEP 4. Conduct descriptive analysis # =====================
To find stop counts per year, we need to define a notion of year (recall that our data only has date). Use the year() and mutate() functions to add a new column called year to our stops data frame and then use count() to determine the number of stops per year.
## year n
## 1 2012 2863
## 2 2013 7874
## 3 2014 7023
## 4 2015 14310
## 5 2016 9559
Use count() to determine the number of stops by race
## subject_race n
## 1 asian/pacific islander 3787
## 2 black 10870
## 3 hispanic 16573
## 4 other 1640
## 5 white 8553
Let’s make another table that gives us the proportion of stops by race.
## subject_race n prop
## 1 asian/pacific islander 3787 0.09142264
## 2 black 10870 0.26241460
## 3 hispanic 16573 0.40009174
## 4 other 1640 0.03959153
## 5 white 8553 0.20647949
This stat on is own, though, doesn’t actually say much. We’ll return to this more rigorously later on.
How about counting how many stops by year and race?
## year subject_race n
## 1 2012 asian/pacific islander 268
## 2 2012 black 624
## 3 2012 hispanic 1114
## 4 2012 other 156
## 5 2012 white 690
## 6 2013 asian/pacific islander 788
## 7 2013 black 1841
## 8 2013 hispanic 3193
## 9 2013 other 321
## 10 2013 white 1710
## 11 2014 asian/pacific islander 657
## 12 2014 black 1815
## 13 2014 hispanic 2691
## 14 2014 other 284
## 15 2014 white 1449
## 16 2015 asian/pacific islander 1271
## 17 2015 black 3889
## 18 2015 hispanic 5787
## 19 2015 other 507
## 20 2015 white 2827
## 21 2016 asian/pacific islander 803
## 22 2016 black 2701
## 23 2016 hispanic 3788
## 24 2016 other 372
## 25 2016 white 1877
Lets visualize
All groups experienced a spike in 2015. Lets look further and see what happened in that year.Filter out the data.
df2015 <- stockton %>% filter(year(date) == 2015)
Look for racial demographics in Stockton (2015) i.e. number of hispanics, blacks, asian, etc. and create a data frame to establish comparison.
population_2015 <- tibble(subject_race = c("asian/pacific islander","black", "hispanic", "other","white"),
num_people = c(4272, 34772, 126048, 39432, 132471)) %>%
mutate(subject_race = as.factor(subject_race))
Lets see our newly formed numbers and with proportion rates.
## # A tibble: 5 x 3
## subject_race num_people proportion
## <fct> <dbl> <dbl>
## 1 asian/pacific islander 4272 0.0127
## 2 black 34772 0.103
## 3 hispanic 126048 0.374
## 4 other 39432 0.117
## 5 white 132471 0.393
Visualization
If we join the two tables(population_2015 and df2015) together, we can compute stop rates by race i.e. number of stops per person (Note: Remember to take into account how many years are in your stop data, in order to get a true value of stops per capita; we’re using only 2015 for stops and for population, so we’re in good shape). Use mutate () to calculate the stop rate.
df2015 %>%
count(subject_race) %>%
drop_na() %>%
left_join(population_2015,by = "subject_race") %>%
mutate(stop_rate = n / num_people)
## subject_race n num_people stop_rate
## 1 asian/pacific islander 1271 4272 0.29751873
## 2 black 3889 34772 0.11184286
## 3 hispanic 5787 126048 0.04591108
## 4 other 507 39432 0.01285758
## 5 white 2827 132471 0.02134052
Even though Asian/Pacific Islander made 1.2 percent of the total population of Stockton in the year 2015 they had the second highest stop rate with 29.8 percent. Hispanics had 4.6 percent stop rate, 2.2 times higher than whites. Blacks had 11.2 stop rate, 5.2 times higher than whites. Is there a difference in sex?
## subject_sex n prop
## 1 female 4406 0.3083922
## 2 male 9881 0.6916078
Unfortunately, we don’t have a data frame to establish comparison of stop rates for sex but we can visualize the number of arrest by race and sex.
Males appear to outnumber females but without a baseline comparison we can not determine the difference in stop rates between males and females. However, a lot of the interaction did not result in an arrest. What were the outcomes?
Let’s do the same sort of benchmark comparison for search and frisk rates. These are easier than the last one since we don’t need an external population benchmark. We can use the stopped population as our baseline, defining search rate to be proportion of stopped people who were subsequently searched, and frisk rate as proportion of stopped people who were subsequently frisked. Let’s get these values by race.
## # A tibble: 5 x 3
## subject_race search_rate arrest_rate
## <chr> <dbl> <dbl>
## 1 asian/pacific islander 0.124 0.0181
## 2 black 0.252 0.0180
## 3 hispanic 0.186 0.0131
## 4 other 0.0750 0.0118
## 5 white 0.113 0.0141
Now let’s dive into these results. As with the stop rates, we can make a quantitative claim about disparities in search and frisk rates by dividing the minority rate by the white rate.
Here we see that among drivers who were stopped, black drivers were searched at a rate 2.2 times higher than white drivers, and Hispanic drivers were searched at a rate 1.6 times higher than white drivers.
Black drivers were frisked at a rate 1.3 times higher than white drivers were, and Hispanic drivers were frisked at a rate 0.9 times higher than white drivers were.
This is all to say that while benchmark stats are a good place to start, more investigation is required before we can draw any conclusions that bias or discrimination is overtly present in Stockton law enforcement.