SOC 712 | Home Work 9 | Exploratory & Model Analysis Through Visualization

Md Afzal Hossain

April 15, 2019

Does Racial Bias Exist in Stop & Frisk:

Using Stop and Frisk data, I am interested to see that is there any pattern the way NYPD stopping people or do they target any spacefic race or neighborhood. Furthermore, I will explore the contributing factor of getting stop and arrested by visualizing different predictive model. I am expected to see pattern, neighborhood and potential racial bias.

About The Data:

I used 2018 data for explotory analysis and buildilding model. Since 2018 data does not have location data (Latitude and Longitude) for ploting maps, i used 2011 data prepared by Columbia University to vizualize location.This is not ideal, but at least it will give us some idea, since NYPD follow same protocol.Every time a police officer stops a person in \(NYC\), the officer is require to fill out a form recording the details of the stop. The forms were filled out by hand and manually entered into an \(NYPD\) database. When the forms became electronic, database are available for download for public . Each row of the data set represent a stop by \(NYPD\) officer. Dataset also contain variable such as the age, race, weight, height of the person stopped, if a person was frisked (when an officer pass the hands over someone in a search for hidden weapons, drugs, or other items and the location of the stop.

Data Import and Preparation:

Stop And Frisk-2018 Data:

# Reading the excel data
sq<- read_excel("C:/Users/Afzal Hossain/Downloads/sqf-2018.xlsx")
# Lets look the Data
dim(sq)
## [1] 11008    83

Which Borough Has The Highest Stop Number:

Boro<-sq %>% group_by(STOP_LOCATION_BORO_NAME) %>% 
                 summarise(num_stops = n()) %>% 
                    mutate(perc_stop = paste0(round(100 * num_stops/sum(num_stops), 0), "%"))


ggplot(Boro, aes(x=reorder(STOP_LOCATION_BORO_NAME,-num_stops),y=num_stops,fill = perc_stop))+
       geom_bar( stat="identity",fill="plum3")+xlab("Borough ")+
          ylab("Number of Stops ")+ggtitle("Number of Stops and Percentage in Different Borough\n") +
            geom_text(aes(label = perc_stop), size = 5,color="black")+
              theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"), 
                    plot.title = element_text(hjust = 0.5,size = 15, face = "bold"))

Time Series Analysis:

Stop and Arrests-Month:

mont1<-sq %>% group_by(MONTH2) %>% 
                 summarise(num_stops = n(),p_arr=mean(arrasted)) %>% 
                    mutate(perc_stop = paste0(round(100 * num_stops/sum(num_stops), 0), "%")) 

# Lable Based on the group by result
mn<-c("Apr", "Aug", "Dec", "Feb", "Jan", "July",'June',"Mar", "May", "Nov","Oct","Sep")

p1<-ggplot(mont1, aes(x=reorder(MONTH2,-num_stops),y=num_stops,fill = perc_stop))+
      geom_bar( stat="identity",fill="plum3")+xlab("Month ")+ylab("Number of Stops ")+
         ggtitle("Number & Percentage of Stops (Month)") +
          geom_text(aes(label = perc_stop), size = 4.5,color="black")+
           theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"),
                 plot.title = element_text(hjust = 0.5,size = 15, face = "bold"),legend.position="none")+scale_x_discrete(labels= mn)

 
p2<-ggplot(mont1, aes(x=MONTH2,y=p_arr,group=1))+geom_line(size = 1.2,color="plum3")+
       geom_point(color="darkcyan",size=2)+xlab("Month ")+
          ylab("Percentage of Arrests ")+ggtitle("Percentage of Arrest(Month)")+
           theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"),
              plot.title = element_text(hjust = 0.5,size = 15, face = "bold"))+
                    scale_x_discrete(labels= mn)+ scale_y_continuous(labels = scales::percent)

grid.arrange(p1, p2, ncol = 2)

Its seems like NYPD already decided how many people they will stop every month to fulfil a specific number which is an allegation against them. Percentage of arrest is different because sometime they get lucky.

Stop and Arrasts-Days:

Percentage of Arrast:
day<-sq %>% group_by(DAY2) %>% 
                 summarise(p_arr=mean(arrasted),num_stops = n()) %>% 
                    mutate(perc_stop = paste0(round(100 * num_stops/sum(num_stops), 0), "%"))           

ggplot(day, aes(x=DAY2,y=p_arr,group=1))+geom_line(size = 1.2,color="plum3")+
       geom_point(color="darkcyan",size=2)+xlab("DAY ")+
           ylab("Percentage of Arrests ")+ggtitle("Percentage of Arrast (Day)")  +
             theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"),
                   plot.title = element_text(hjust = 0.5,size = 15, face = "bold"))+ scale_y_continuous(labels = scales::percent)

Number of Stop and Arrast:
day1<-sq %>% group_by(DAY2) %>% 
                 summarise(Number.Arrasr =sum(arrasted),Number.Stops = n()) 
test_data_long <- melt(day1, id="DAY2") 

ggplot(data=test_data_long,
       aes(x=DAY2, y=value,fill=variable)) +
       geom_bar(stat="identity")+xlab("DAY")+ylab("Number of Stops ")+ggtitle("Number of Stop & Arrast (DAY)") +geom_text(aes(label = value), size = 3,vjust = 2,color="BLACK",position = position_stack())+theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"), plot.title = element_text(hjust = 0.5,size = 15, face = "bold"))+scale_fill_manual(values = c('darkcyan', 'plum3'))

Stop Number and Distribution-Age:

age1<-sq%>%
  group_by(SUSPECT_REPORTED_AGE)%>%
  summarise(num_stops = n())
  

pa<-ggplot(data=age1, aes(x=SUSPECT_REPORTED_AGE,y=num_stops))+geom_line(size = 1.2,color="plum3")+xlab("AGE ")+ylab("Number of Stop ")+ggtitle("Number of Stop in  Different AGE\n")  +theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"), plot.title = element_text(hjust = 0.5,size = 15, face = "bold"))

sq<-sq%>%filter(!is.na(SUSPECT_RACE_DESCRIPTION))
pd<-ggplot(sq, aes(x = as.numeric(SUSPECT_REPORTED_AGE), fill =SUSPECT_RACE_DESCRIPTION)) +
      geom_density(alpha = 0.5)+facet_wrap(~SUSPECT_SEX)+
          theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"),
             plot.title = element_text(hjust = 0.5,size = 15, face = "bold"))+
                ggtitle("Distribution of AGE \n")+xlab("AGE")+scale_fill_manual(name="RACE",
                    labels = c("BLACK", "LATINO", "OTHER", "WHITE"), 
                    values = c("BLACK" ="#00ba38", "LATINO"="#f8766d","OTHER"="lightskyblue","WHITE"="plum3"))


grid.arrange(pa, pd, ncol = 2)

Stop and Arrast-Race & Gender:

Number of Stop and Arrasts:
sq<- sq %>% filter(!is.na(sq$SUSPECT_RACE_DESCRIPTION),!is.na(SUSPECT_SEX))
ggplot(sq, aes(x = as.factor(SUSPECT_RACE_DESCRIPTION), fill = factor(arrasted))) +
        geom_bar() +facet_wrap(~SUSPECT_SEX) +  ggtitle("Total Stop and Arrest (Race & Gender)") +
          xlab("Race") +ylab("Total Stops") + 
           labs(fill = "Arrasted")+theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"),
                plot.title = element_text(hjust = 0.5,size = 15, face = "bold"))+scale_fill_manual(values = c('darkcyan', 'plum3'))

Arrested Percentage and Number of Stop Without Gender:

race1<-sq %>% group_by(SUSPECT_RACE_DESCRIPTION) %>% 
                 summarise(p_arr=mean(arrasted),num_stops = n()) 

rp<-ggplot(data = race1, aes(y=SUSPECT_RACE_DESCRIPTION, x=p_arr)) + 
    geom_point(aes(size=num_stops),color="darkcyan")+ scale_fill_gradient2(low = ("red"), mid = "white",high = ("blue"),
            midpoint = 0, space = "Lab",na.value = "grey50", guide = "colourbar") + coord_flip() +
                  ggtitle(" Arrested Percentage and Number of Stop of Different Race")+xlab("Parcentage of Arrested")+
                       ylab("RACE")+ theme(panel.background =   element_rect(fill = "aliceblue",colour = "aliceblue"), 
                               plot.title = element_text(hjust = 0.5,size = 15, face = "bold")) 
                          
 ggplotly(rp)                 

It is very revealing that white arrest rate is higher than black, but black were stop more.

Calculation Z-Score for Stop Duration:

sq$stop_z_score <- round((sq$STOP_DURATION_MINUTES - mean(sq$STOP_DURATION_MINUTES))/sd(sq$STOP_DURATION_MINUTES), digits=2)
sq$stop_type <- ifelse(sq$stop_z_score < 0, "below", "above")
sq <- sq[order(sq$stop_z_score), ]
Ploting Z-Score For Race and Month:
zpr<-ggplot(sq, aes(x=SUSPECT_RACE_DESCRIPTION, y=stop_z_score, label=stop_z_score)) + 
  geom_bar(stat='identity', aes(fill=stop_type), width=.8)  +
  scale_fill_manual(name="Minute", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("above"="plum3", "below"="darkcyan")) + 
  labs(title= "Normalised Time from Stop Duration:Race") + xlab("RACE")+ylab("Z-score:Stop Duration")+
  coord_flip()+ theme(panel.background =   element_rect(fill = "aliceblue",colour = "aliceblue"), 
                               plot.title = element_text(hjust = 0.5,size = 15, face = "bold")) 

zpm<-ggplot(sq, aes(x=MONTH2, y=stop_z_score, label=stop_z_score)) + 
  geom_bar(stat='identity', aes(fill=stop_type), width=.8)  +
  scale_fill_manual(name="Minute", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("above"="plum3", "below"="darkcyan")) + 
  labs(title= "Normalised Time from Stop Duration:Month") + xlab("MONTH")+ ylab("Z-score:Stop Duration")+
  coord_flip()+theme(panel.background =   element_rect(fill = "aliceblue",colour = "aliceblue"), 
                               plot.title = element_text(hjust = 0.5,size = 15, face = "bold")) 


grid.arrange(zpr, zpm, ncol = 2)

Stop and Arrast-Naborhood:

top_st<-sq %>% group_by(STOP_LOCATION_STREET_NAME) %>% 
                 summarise(num_stops = n(),Total.arrast=sum(arrasted)) %>%  arrange(desc(num_stops))
                                   
#Taking top ten                           
top_st<-head(top_st,10)

# Making long data
st_data_long <- melt(top_st, id="STOP_LOCATION_STREET_NAME") 


ggplot(data=st_data_long,
    aes(x=STOP_LOCATION_STREET_NAME, y=value,fill=variable)) +coord_flip()+
      geom_bar(stat='identity') +  ggtitle("Top 10 Street Targeted by NYPD") +
        ylab("Total Stop") +xlab("STREET NAME")+
            labs(fill = "Arrasted")+
                theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"), 
                      plot.title = element_text(hjust = 0.5,size = 12, face = "bold"))+
                       scale_fill_manual(values = c('darkcyan','plum3'))+
                          geom_text(aes(label = value), size = 4,vjust =
                                .5,hjust=1.2,color="BLACK",position = position_stack())

Street with minority(Jamaica,Fulton Street,3 street) has very low arrest proportion.

Building Logistic Model

#Model
log.m <- glm(arrasted ~ SUSPECT_SEX +SUSPECT_REPORTED_AGE + weapon_found*Frisked +SUSPECT_RACE_DESCRIPTION+OTHER_WEAPON_FLAG , family = binomial, data = sq)
library(jtools)
plot_summs(log.m, scale = TRUE, exp = TRUE,color.class = "dodgerblue3")+ theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"))

Predictor-Race:

effect_plot(log.m, pred=SUSPECT_RACE_DESCRIPTION, int.width = .8,colors="dodgerblue3")+ theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"))

Predictor-Sex:

effect_plot(log.m, pred=SUSPECT_SEX, int.width = .8,colors="dodgerblue3")+ theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"))

Predictor- Age:

effect_plot(log.m, pred = SUSPECT_REPORTED_AGE, interval = TRUE,colors="dodgerblue3")

Building Random Forast Model:

What is a Random Forest

A detailed study of Random Forests would take too long. However, since it’s an often used machine learning technique, it makes sense to provide with a general understanding and illustrate how to apply the technique using R.In layman terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees . At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. (DataCamp)

test<-select(sq,"arrasted","STOP_LOCATION_BORO_NAME","STOP_DURATION_MINUTES","SUSPECT_REPORTED_AGE","SUSPECT_SEX","SUSPECT_RACE_DESCRIPTION","SUSPECT_RACE_DESCRIPTION","SUSPECT_WEIGHT","weapon_found","Frisked","MONTH2")
arr_tree<- rpart(arrasted~STOP_LOCATION_BORO_NAME+STOP_DURATION_MINUTES+SUSPECT_REPORTED_AGE+SUSPECT_SEX+SUSPECT_WEIGHT+weapon_found+Frisked+SUSPECT_RACE_DESCRIPTION, data=test, method="class",control = rpart.control(minsplit =1,minbucket = 1,cp=.0008))

fancyRpartPlot(arr_tree)

Apply the Random Forest Algorithm:

library(randomForest)
train1 <- test[1:7000,]
Arrast_forest <- randomForest(arrasted~STOP_LOCATION_BORO_NAME+STOP_DURATION_MINUTES+SUSPECT_REPORTED_AGE+SUSPECT_SEX+weapon_found+Frisked+SUSPECT_WEIGHT+SUSPECT_RACE_DESCRIPTION, data=train1, importance=TRUE, ntree=1000)

Ploting the Accuracy and Gini:

varImpPlot(Arrast_forest)

The accuracy plot shows how much worst the model would perform without the included variables.So a high decrease (= high value x-axis) links to a high predictive variable. The second plot is the Gini coefficient. The higher the variable scores here, the more important it is for the model.

2011 Stop and Frisk Data For Geographic Visualization:

# Reading the excel data
sqf<- read.csv("C:/Users/Afzal Hossain/Downloads/Stops2011WithPrecinct.csv")
dim(sqf)
## [1] 660697    115

Taking Random 10000 Arrast for Felony,Weapon,Robbery & Misdemeanor:

F_A<-w_s<-sqf[ sample(which(sqf$crimsusp == "FELONY"), 10000 ), ] 
fp<-qmplot(X, Y, data =  F_A, colour = race, size = I(.5),  maptype = "toner",zoom=11 )+ ggtitle("Race & Crime:Felony")+ theme(panel.background =   element_rect(fill = "aliceblue",colour = "aliceblue"), plot.title = element_text(hjust = 0.5,size = 12, face = "bold"))

w_s<-sqf[ sample(which(sqf$crimsusp == "CPW"), 10000 ), ]
wp<-qmplot(X, Y, data = w_s, colour = race, size = I(.5),  maptype = "toner")+ ggtitle("Race & Crime:Weapon")+ theme(panel.background =   element_rect(fill = "aliceblue",colour = "aliceblue"), plot.title = element_text(hjust = 0.5,size = 12, face = "bold"))

Ro_s<-sqf[ sample(which(sqf$crimsusp == "ROBBERY"), 10000 ), ]
ro_p<-qmplot(X, Y, data =Ro_s, colour = race, size = I(.5), maptype = "toner")+ ggtitle("Race & Crime:Robbery")+ theme(panel.background =   element_rect(fill = "aliceblue",colour = "aliceblue"), plot.title = element_text(hjust = 0.5,size = 12, face = "bold"))

Mi_s<-sqf[ sample(which(sqf$crimsusp == "MISD"), 10000 ), ]
mi_p<-qmplot(X, Y, data =Mi_s, colour = race, size = I(.5), maptype = "toner")+ ggtitle("Race & Crime:Misdemeanor")+ theme(panel.background =   element_rect(fill = "aliceblue",colour = "aliceblue"), plot.title = element_text(hjust = 0.5,size = 12, face = "bold"))

Poting Felony & Weapon Arrest:

grid.arrange(fp, wp, ncol = 2)

Poting Robbery & Misdemeanor Arrest:

grid.arrange(ro_p,mi_p, ncol = 2)

We can clearly see that some neighborhood has gun problem mostly Brooklyn. Most of the arrest for felony and weapon charge were lations and African Americans. White and other race more likely to get arrest for misdemeanors.

What I Learn:

After this expletory analysis and model analysis, I can say NYPD try to fulfilling predefine number of stops by stopping fixed number of people. This might cause racial biases that indicated the low arrest rate but high stop number of black people which kind of prove my hypothesis. However, at the same time most of the successful arrest were Black race. Therefore, even though I was able to get some bias, we need more detail analysis including censes data that will give us a better idea about demography.