The dataset in this subproject comes from NYPD stop and frisk program: All stop and frisk cases in NYC for 2014.

Each observation is a ‘stop’ case. Each stop case has several attributes such as location, identification of parties, and also the outcome of the stop: a frisk, a search, an arrest, and furthermore attributes of those outcomes.

We will answer the following query: Frisk “failure” rates based on explanation of search.

library(tidyr);library(dplyr);library(stringr)
#raw<-read.csv('G:/Property/Luis_C/statsLearning/CUNY/dataClass/project2/2014_sqf_csv/2014_slim.csv',stringsAsFactors = FALSE,header=1)

raw<-read.csv('~/Documents/CUNY/data_class/project2/2014_slim_new.csv',stringsAsFactors = FALSE,header=1)

Define a frisk failure. We identify two fields that may serve to categorize a failure/success outcome: arstmade, sumissue

table(raw$arstmade)
## 
##     N     Y 
## 38889  6898
table(raw$sumissue)
## 
##     N     Y 
## 44573  1214

Without having access to in-depth descriptions of the interactions between these data points, we derive them from analyzing the dataset.

Are there observations (stop cases) with multiple true values (row=“Y”) between the arrest made and summons issued?

te<-raw %>%
  select(key,sumissue,arstmade) %>%
  gather(type,success,c(sumissue,arstmade)) %>%
  group_by(key) %>%
  summarise(count=length(success[success=='Y']))

table(te$count)
## 
##     0     1     2 
## 37744  7974    69

This confirms that there are observations in the dataset where the officer will record an arrest and a summons issuance. This informs our calculation for proportion of success later.

Before moving on, let’s run a similar script to determine if officers always record a reason for frisk along with a reason for stop. This will give us a clear picture on how to properly measure success/failure of a stop and reason for a stop.

Columns describing reason for search being with ‘rf… reason for stop begins with ’cs’. We begin an ‘either, or’ comparison of these values by looking at frequencies of co-observance in the data.

Let’s check the distribution of the amount of reasons that can be given for a stop - independent of whether a search or frisk was carried out.

r="^(cs|rf)_"
tes<-raw %>%
  select(matches(r))

tes$key<-raw$key

tes.1<-tes %>%
  gather(type,success,-key) %>%
  group_by(key) %>%
  summarise(count=length(success[success=='Y']))

table(tes.1$count)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 10860  7118 11242  6854  4640  2490  1318   611   343   183    86    31 
##    12    13 
##     7     4

The distribution reveals that there are indeed many cases where more than one reason for stop is recorded.

This motivates another look: will officers always record a reason for stop independent of a frisk?

We research this by assigning two categories: a reason for a stop is provided, and a reason for frisk is provided - these are independent and not mutually exclusive.

#success == 'Y' if the reason for stop and search is categorized by the value in 'type' field
tes.3<-tes %>%
  gather(type,success,-key) %>%
  mutate(group=ifelse(success=='Y',substr(type,1,1),''))
  
#head(tes.3[tes.3$group!='',])
tes.4<-tes.3 %>%
  filter(group!='') %>%
  group_by(key,group) %>%
  summarise(groups=max(group))

#make sure this works:
tes.4 %>%
  group_by(key) %>%
  summarise(len=length(groups)) %>%
  select(len) %>%
  table()
## .
##     1     2 
## 11914 23013
#it does

If they search, do they always record a reason for stop?

#length(unique(tes.3[tes.3$group!='','key'])) #34,927
gy<-tes.4 %>%
  group_by(key) %>%
  mutate(len=length(groups)) %>%
  spread(groups,group) %>%
  filter(!is.na(r)) 

table(gy$c)
## 
##     c 
## 23013

This table proves that there is always a “reason for stop” code entered whenever there is a search. We’ll use the stop code and not the frisk code for the reason code, with this new-found confidence that a code is recorded for each stop, regardless a frisk.

Now determine success: Success is determined when either an arrest or a summons is made. We recode the dataframe here.

As stated earlier, reasons for stop codes begin with ‘cs’

For our last calculation we calculate the proportion of successes per each ‘cs’ category.

r="^(cs_)|arstmade|sumissue|key"

#categorize a success/failure as arrest or summons issued
pre<-raw %>%
  select(matches(r)) %>%
  mutate(success=ifelse(arstmade=='Y'|sumissue=='Y','Y','N'))

blah<-pre %>%
  select(-c(arstmade,sumissue)) %>%
  gather(reason.stop,value,cs_objcs:cs_other) %>%
  filter(value=='Y') %>%
  group_by (reason.stop) %>%
  summarise(prop=length(key[success=='Y'])/length(key))

blah
## # A tibble: 10 x 2
##    reason.stop       prop
##          <chr>      <dbl>
## 1     cs_bulge 0.19864048
## 2     cs_casng 0.09927984
## 3     cs_cloth 0.12627986
## 4     cs_descr 0.15813191
## 5     cs_drgtr 0.27561705
## 6     cs_furtv 0.17171121
## 7     cs_lkout 0.10944882
## 8     cs_objcs 0.34071550
## 9     cs_other 0.23558067
## 10    cs_vcrim 0.13347802

Conclusion

Type I errors are strongest for ‘cs_casng’: REASON FOR STOP - CASING A VICTIM OR LOCATION and ‘cs_lkout’: REASON FOR STOP - SUSPECT ACTING AS A LOOKOUT. The city should weigh the legal rammifications of type I errors and these kinds of stops. It may not be prudent to stop individuals for these reasons alone.