Competing Approaches for Predicting Stop Duration By Police


Research Question:


I will use same dataset that I used homework 4 that is stop and frisk dataset. However, this week instead of likelihood of someone being arrested based on different variables, I am interested to see that is there any difference between genders in terms of stop duration (in minute). In this dataset, I have information on two different conceptual levels. Street/Neighborhood-level & suspect-level. Individual are nested within Neighborhood. Neighborhood provide a context for suspect, since different neighborhoods has different economical and demographical characteristics. Therefore, I am expected to see difference in stop duration in different street.


Reminder of about The Data

The dataset I will use is stop and frisk data from \(2018\). Every time a police officer stops a person in \(NYC\), the officer is require to fill out a form recording the details of the stop. The forms were filled out by hand and manually entered into an \(NYPD\) database. When the forms became electronic, database are available for download for public . Each row of the data set represent a stop by \(NYPD\) officer. Dataset also contain variable such as the age, race, weight, height of the person stopped, if a person was frisked (when an officer pass the hands over someone in a search for hidden weapons, drugs, or other items and the location of the stop.

Data Import and Preparation

Importing Library:

library(nlme)
library(sjmisc)
library(dplyr)
library(magrittr)
library(tidyr)
library(haven)
library(lmerTest)
library(ggplot2)
library(texreg)
library(DT)
library(readxl)

Data Import:

# Reading the excel data
sqf<- read_excel("C:/Users/Afzal Hossain/Downloads/sqf-2018.xlsx")

#Keeping the variable that i am interested in.
sq<-select(sqf, 'SUSPECT_SEX',"STOP_DURATION_MINUTES","STOP_LOCATION_STREET_NAME","SUSPECT_REPORTED_AGE","STOP_LOCATION_BORO_NAME")

# Lets look the Data
dim(sq)
## [1] 11008     5

Removing NA:

# making the the variable character temporarily
sq$SUSPECT_SEX <- as.character(sq$SUSPECT_SEX)
sq$SUSPECT_SEX[sq$SUSPECT_SEX == "(null)"] <- NA

sq$SUSPECT_REPORTED_AGE <- as.character(sq$SUSPECT_REPORTED_AGE)
sq$SUSPECT_REPORTED_AGE[sq$SUSPECT_REPORTED_AGE == "(null)"] <- NA


Changing Data Type and Recoding:

sq$SUSPECT_SEX<- as.factor(sq$SUSPECT_SEX)
sq$STOP_DURATION_MINUTES<-as.double(sq$STOP_DURATION_MINUTES)
sq$STOP_LOCATION_STREET_NAME<-as.factor(sq$STOP_LOCATION_STREET_NAME)
sq$SUSPECT_REPORTED_AGE<-as.numeric(sq$SUSPECT_REPORTED_AGE)
sq$STOP_LOCATION_BORO_NAME<-as.factor(sq$STOP_LOCATION_BORO_NAME)

## Creating New Variables
sq <- sq %>% 
  mutate(SUS_SEX = as.integer(SUSPECT_SEX))


Recoding Male=0 and Female=1 and filtering NA:

# Recoding and filtering
sq <- sq %>% 
        filter(!is.na(SUS_SEX),!is.na(STOP_LOCATION_STREET_NAME),!is.na(SUSPECT_REPORTED_AGE),
              !is.na(STOP_DURATION_MINUTES))%>%  mutate(sex = sjmisc::rec(SUS_SEX, rec = "2=0; 1=1")) %>%
              select(STOP_DURATION_MINUTES,sex,everything())
datatable(head(sq))


Considering street, that has more than 10 stops:

sq<-sq[sq$STOP_LOCATION_STREET_NAME %in% names(which(table(sq$STOP_LOCATION_STREET_NAME) > 9)), ]
dim(sq)
## [1] 5528    7

By doing so, I lost half of my data (from 11008 to 5528) which is not ideal, but it is for learning purpose and in real research, I would have combine multiple year data that will element this problem of having a street with only 1 or 2 stops. Technically, I can do regression with more than one stop, but that will not be logical.


Streets/Neighborhoos


How many Streets:

# Unique streets
length(unique(sq$STOP_LOCATION_STREET_NAME))
## [1] 265

I have 265 street/neighborhood, which is jut enought for multilevel analysis.


Finding number of stop in each street:

# Group bt street
num_stop<-sq %>% 
  group_by(STOP_LOCATION_STREET_NAME) %>% 
  summarise(n_stop = n())

datatable(num_stop)

Individual Level Analysis


Ignoring Different Streets


Complete Pooling

Complete Pooling Model:


cp <- lm(STOP_DURATION_MINUTES~sex , data = sq)
summary(cp)
## 
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ sex, data = sq)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14.09  -5.97  -3.97   2.03 407.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.9698     0.2198  49.912  < 2e-16 ***
## sex           3.1168     0.7091   4.395 1.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.54 on 5526 degrees of freedom
## Multiple R-squared:  0.003484,   Adjusted R-squared:  0.003303 
## F-statistic: 19.32 on 1 and 5526 DF,  p-value: 1.127e-05

Looking at complete pooling model, where I looked at the sex without street, I can say that male stop duration on average is 11 minute where for female is 3.1 minute more than that. Now it clearer that female’s stop duration is longer than male. May be NYPD try to treat female more gently and take things easy with female than male.


No-Pooling Model

Since, a lot can change in different street/neighborhood, I will run model for each of 265 neighborhood/street.


No-Pooling Model:


# Building model for every street
dcoef <- sq %>% 
    group_by(STOP_LOCATION_STREET_NAME) %>% do(mod = lm(STOP_DURATION_MINUTES~sex, data = .))


No-Pooling Model-The Intercept:


coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram()+xlab("Police Stop Duration for Male for 264 Street")+
  theme(panel.background = element_rect(fill = "lightblue1", colour = "lightblue1"))


Looking at the distribution interaction of my model for each neighborhood, I can say that average stop duration for man is around 11, but it can last few minute to over 40 minute based on different street neighborhood.


No-Pooling model-The slope:


coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[2]))
ggplot(coef, aes(x = intc)) + geom_histogram()+xlab("Gender Difference for Police Stop Duration in 264 Different Street")+ theme(panel.background = element_rect(fill = "lightblue1", colour = "lightblue1"))


This distribution show that the difference of average stop length in terms of gender. I can see that most neighborhood, female have on average around 5 minute longer stop time than male, but at the same time some of the street, female stop time average 30 minute less than male to 50 minute more than male.

Partial-Pooling & Ecological Analysis


Street/Neighborhood Level Analysis


Ecological Analysis

#Group by Neighborhood
stop_street <- sq %>% 
  group_by(STOP_LOCATION_STREET_NAME) %>% 
  summarise(mean_stop_time = mean(STOP_DURATION_MINUTES), mean_s = mean(sex))

# Ecological model
ecoreg <- lm(mean_stop_time ~ mean_s, data = stop_street)
summary(ecoreg)
## 
## Call:
## lm(formula = mean_stop_time ~ mean_s, data = stop_street)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.134 -3.073 -1.158  1.927 30.585 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.8734     0.4377  24.840   <2e-16 ***
## mean_s        3.1292     3.5296   0.887    0.376    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.97 on 263 degrees of freedom
## Multiple R-squared:  0.00298,    Adjusted R-squared:  -0.0008113 
## F-statistic: 0.786 on 1 and 263 DF,  p-value: 0.3761

Looking at this analysis, I can say one unit increase in the mean proportion of female will increase street mean of stop time up by 3.13. This is counterintuitive that increase in female will increase in stop time.

Random Intercept

This is random intercept model where it will not allow the gender difference in stop duration to differ between streets/neighborhoods.


r1_lme <- lme(STOP_DURATION_MINUTES~sex, data = sq, random = ~1|STOP_LOCATION_STREET_NAME,method = "ML")
summary(r1_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: sq 
##        AIC      BIC    logLik
##   45954.54 45981.01 -22973.27
## 
## Random effects:
##  Formula: ~1 | STOP_LOCATION_STREET_NAME
##         (Intercept) Residual
## StdDev:    3.086912 15.22948
## 
## Fixed effects: STOP_DURATION_MINUTES ~ sex 
##                 Value Std.Error   DF  t-value p-value
## (Intercept) 10.915870 0.3013626 5262 36.22172   0e+00
## sex          2.795946 0.7077600 5262  3.95042   1e-04
##  Correlation: 
##     (Intr)
## sex -0.216
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.4685819 -0.4264595 -0.2418690  0.1170009 25.9647859 
## 
## Number of Observations: 5528
## Number of Groups: 265

Holding Gender constant and allowing between-street variations, I can say that standard deviation for male is 3.1. I can see that slop (gender deference’s of stop duration) decreases a little, this because of different street.

Random Slope Model


ctrl <- lmeControl(opt='optim')

rs_lme <-lme(STOP_DURATION_MINUTES~sex, data = sq, random = ~sex|STOP_LOCATION_STREET_NAME,control=ctrl,method = "ML")
summary(rs_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: sq 
##        AIC      BIC    logLik
##   45954.06 45993.77 -22971.03
## 
## Random effects:
##  Formula: ~sex | STOP_LOCATION_STREET_NAME
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept)  3.094139 (Intr)
## sex          2.913017 -0.127
## Residual    15.202778       
## 
## Fixed effects: STOP_DURATION_MINUTES ~ sex 
##                 Value Std.Error   DF  t-value p-value
## (Intercept) 10.938694 0.3018118 5262 36.24342  0.0000
## sex          2.496181 0.7638182 5262  3.26803  0.0011
##  Correlation: 
##     (Intr)
## sex -0.231
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.4317774 -0.4233091 -0.2432522  0.1168222 25.9974206 
## 
## Number of Observations: 5528
## Number of Groups: 265

Looking at this analysis, I see that there is alteration in slop (gender differences) in different street is 2.4. Allowing the gender difference in stop duration to differ between street/neighborhood, I can say that, we decries the stop compared to the pooling model,but also random intercept model. Therefore, we can say that this prove my hypothesis that there is both street/neighborhoods and gender variation in police stop time.


Best Model


datatable(AIC(cp, r1_lme,rs_lme))

The AIC is lowest for rs_lme that is Random slop, though it is pretty close to Random intercept model. We can see that allow the gender difference in stop duration to differ between street/neighborhoods best way to predict stop durations among gender.


Intra-Class Correlation


Is stop duration mainly an individual-level or street-level thing?

stop_m0_lme <- lme(STOP_DURATION_MINUTES ~ 1, random = ~ 1| STOP_LOCATION_STREET_NAME, data = sq, method = "ML")
summary(stop_m0_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: sq 
##        AIC      BIC    logLik
##   45968.12 45987.97 -22981.06
## 
## Random effects:
##  Formula: ~1 | STOP_LOCATION_STREET_NAME
##         (Intercept) Residual
## StdDev:    3.112088 15.24888
## 
## Fixed effects: STOP_DURATION_MINUTES ~ 1 
##                Value Std.Error   DF  t-value p-value
## (Intercept) 11.17244 0.2955366 5263 37.80392       0
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.4360613 -0.4248297 -0.2466531  0.1203523 25.9102166 
## 
## Number of Observations: 5528
## Number of Groups: 265


ICC<-3.112088/(3.112088+15.24888)
ICC
## [1] 0.1694948

Intra-class correlation (ICC) value of .1696 suggests that around 17% of the variation in suspect stop duration by police can be attributed to different streets/neighborhoods. Therefore, I can say that the majority of the variation is due to individual-level differences and I can say that with confidence interval of 95% as we see below.


Confidence Interval


intervals(stop_m0_lme)
## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                lower     est.    upper
## (Intercept) 10.59312 11.17244 11.75176
## attr(,"label")
## [1] "Fixed effects:"
## 
##  Random Effects:
##   Level: STOP_LOCATION_STREET_NAME 
##                    lower     est.    upper
## sd((Intercept)) 2.553249 3.112088 3.793243
## 
##  Within-group standard error:
##    lower     est.    upper 
## 14.96082 15.24888 15.54248

Final Thoughts


After analyzing all the models, I can say that my hypnotist is right that there is deference in terms of police stope duration among gender and as well as statistically significant in different streets. The difference is more significant in individual level than the street level. In future, we might want to take a bigger sample by combining multiple year data.