===================== # STEP 1: Set up my environment # =====================

Notes: setting up my R environment by loading ‘tidyverse’ and Stockton data set.

library(tidyverse)
library(lubridate)
library(janitor)
library(scales)
stockton<-read.csv("ca_stockton_2020_04_01.csv")

===================== # STEP 2: Explore the data set # =====================

First, let’s take a look at what columns exist in our stockton data frame.

colnames(stockton)
##  [1] "raw_row_number"   "date"             "division"         "subject_age"     
##  [5] "subject_race"     "subject_sex"      "officer_id_hash"  "type"            
##  [9] "arrest_made"      "citation_issued"  "warning_issued"   "outcome"         
## [13] "search_conducted" "search_basis"     "reason_for_stop"  "raw_result"      
## [17] "raw_search"

How many stops do we have in our dataset?

nrow(stockton)
## [1] 41629

What date range does our data cover?

## [1] "2012-01-01"

to

## [1] "2016-12-31"

===================== # STEP 3. Clean the data # =====================

Our analyses is only relevant for traffic stops so lets make sure our data frame is applicable for such data

 unique(stockton$type)
## [1] "vehicular"

===================== # STEP 4. Conduct descriptive analysis # =====================

To find stop counts per year, we need to define a notion of year (recall that our data only has date). Use the year() and mutate() functions to add a new column called year to our stops data frame and then use count() to determine the number of stops per year.

##   year     n
## 1 2012  2863
## 2 2013  7874
## 3 2014  7023
## 4 2015 14310
## 5 2016  9559

Use count() to determine the number of stops by race

##             subject_race     n
## 1 asian/pacific islander  3787
## 2                  black 10870
## 3               hispanic 16573
## 4                  other  1640
## 5                  white  8553

Let’s make another table that gives us the proportion of stops by race.

##             subject_race     n       prop
## 1 asian/pacific islander  3787 0.09142264
## 2                  black 10870 0.26241460
## 3               hispanic 16573 0.40009174
## 4                  other  1640 0.03959153
## 5                  white  8553 0.20647949

This stat on is own, though, doesn’t actually say much. We’ll return to this more rigorously later on.

How about counting how many stops by year and race?

##    year           subject_race    n
## 1  2012 asian/pacific islander  268
## 2  2012                  black  624
## 3  2012               hispanic 1114
## 4  2012                  other  156
## 5  2012                  white  690
## 6  2013 asian/pacific islander  788
## 7  2013                  black 1841
## 8  2013               hispanic 3193
## 9  2013                  other  321
## 10 2013                  white 1710
## 11 2014 asian/pacific islander  657
## 12 2014                  black 1815
## 13 2014               hispanic 2691
## 14 2014                  other  284
## 15 2014                  white 1449
## 16 2015 asian/pacific islander 1271
## 17 2015                  black 3889
## 18 2015               hispanic 5787
## 19 2015                  other  507
## 20 2015                  white 2827
## 21 2016 asian/pacific islander  803
## 22 2016                  black 2701
## 23 2016               hispanic 3788
## 24 2016                  other  372
## 25 2016                  white 1877

Lets visualize

All groups experienced a spike in 2015. Lets look further and see what happened in that year.Filter out the data.

df2015 <- stockton %>% filter(year(date) == 2015)

Look for racial demographics in Stockton (2015) i.e. number of hispanics, blacks, asian, etc. and create a data frame to establish comparison.

population_2015 <- tibble(subject_race = c("asian/pacific islander","black", "hispanic", "other","white"),
  num_people = c(4272, 34772, 126048, 39432, 132471)) %>%
  mutate(subject_race = as.factor(subject_race))

Lets see our newly formed numbers and with proportion rates.

## # A tibble: 5 x 3
##   subject_race           num_people proportion
##   <fct>                       <dbl>      <dbl>
## 1 asian/pacific islander       4272     0.0127
## 2 black                       34772     0.103 
## 3 hispanic                   126048     0.374 
## 4 other                       39432     0.117 
## 5 white                      132471     0.393

Visualization

Stop rates

——————

If we join the two tables(population_2015 and df2015) together, we can compute stop rates by race i.e. number of stops per person (Note: Remember to take into account how many years are in your stop data, in order to get a true value of stops per capita; we’re using only 2015 for stops and for population, so we’re in good shape). Use mutate () to calculate the stop rate.

df2015 %>% 
  count(subject_race) %>% 
  drop_na() %>% 
  left_join(population_2015,by = "subject_race") %>%
  mutate(stop_rate = n / num_people)
##             subject_race    n num_people  stop_rate
## 1 asian/pacific islander 1271       4272 0.29751873
## 2                  black 3889      34772 0.11184286
## 3               hispanic 5787     126048 0.04591108
## 4                  other  507      39432 0.01285758
## 5                  white 2827     132471 0.02134052

Even though Asian/Pacific Islander made 1.2 percent of the total population of Stockton in the year 2015 they had the second highest stop rate with 29.8 percent. Hispanics had 4.6 percent stop rate, 2.2 times higher than whites. Blacks had 11.2 stop rate, 5.2 times higher than whites. Is there a difference in sex?

##   subject_sex    n      prop
## 1      female 4406 0.3083922
## 2        male 9881 0.6916078

Unfortunately, we don’t have a data frame to establish comparison of stop rates for sex but we can visualize the number of arrest by race and sex.

Males appear to outnumber females but without a baseline comparison we can not determine the difference in stop rates between males and females. However, a lot of the interaction did not result in an arrest. What were the outcomes?


Search rates

——————

Let’s do the same sort of benchmark comparison for search and frisk rates. These are easier than the last one since we don’t need an external population benchmark. We can use the stopped population as our baseline, defining search rate to be proportion of stopped people who were subsequently searched, and frisk rate as proportion of stopped people who were subsequently frisked. Let’s get these values by race.

## # A tibble: 5 x 3
##   subject_race           search_rate arrest_rate
##   <chr>                        <dbl>       <dbl>
## 1 asian/pacific islander      0.124       0.0181
## 2 black                       0.252       0.0180
## 3 hispanic                    0.186       0.0131
## 4 other                       0.0750      0.0118
## 5 white                       0.113       0.0141

Now let’s dive into these results. As with the stop rates, we can make a quantitative claim about disparities in search and frisk rates by dividing the minority rate by the white rate.

Here we see that among drivers who were stopped, black drivers were searched at a rate 2.2 times higher than white drivers, and Hispanic drivers were searched at a rate 1.6 times higher than white drivers.

Black drivers were frisked at a rate 1.3 times higher than white drivers were, and Hispanic drivers were frisked at a rate 0.9 times higher than white drivers were.

This is all to say that while benchmark stats are a good place to start, more investigation is required before we can draw any conclusions that bias or discrimination is overtly present in Stockton law enforcement.