Objective

The objective here is to perform exploratory data analysis to understand the NYC Civilian Complain Review Board (CCRB) dataset and identify patterns and trends within the dataset by creating different data visualizations.

Load Data

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.2

ccrb<-read.csv("ccrb_datatransparencyinitiative.csv")
attach(ccrb)

SPREAD OF DATA OVER THE YEARS

It will be good to get an overall sense of the spread of the data over the years to understand the peaks and troughs.

Number of complaints received by year

The years 2006 to 2010 saw the highest number of complaints received. Since 2007, the complaints have been reducing relatively to previous years. It will be interesting to understand the trends of other variables to see what may have caused the peak in 2006 and 2007.

histogram <- hist(Received.Year, col="pink", xlab="Year",
     main="Number of complaints received by year") 
xfit<-seq(min(Received.Year),max(Received.Year),length=50)
yfit<-dnorm(xfit,mean=mean(Received.Year),sd=sd(Received.Year))
yfit <- yfit*diff(histogram$mids[1:2])*length(Received.Year)
lines(xfit, yfit, col="black", lwd=2)

Are there many unreported incidents?

The number of complaints by incident year is very similar to the number of complaints received per year, which means not many incidents are unreported.

par(mfrow=c(2,1))

Num_Incidents <- table(Incident.Year)
plot(Num_Incidents, type = "o", xlab="Year of Incident", ylab="Count", main="Number of incidents occurred")

Num_Complaints_Received <- table(Received.Year)
plot(Num_Complaints_Received, type = "o", xlab="Year Received", ylab="Count", main="Number of complaints received")

Number of complaints closed vs. received

A look at the visualization below clearly indicates that the CCRB started receiving complaints from the year 1999 but started closing them only from 2006 which probably explains the peak in complaints received in 2006 and 2007.

par(mfrow=c(2,1))
Complaints_received <- table(Received.Year)
Complaints_received

## Received.Year
##  1999  2000  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011 
##    13     6    15    25   864 11935 23864 24384 22191 21365 17817 16454 
##  2012  2013  2014  2015  2016 
## 15683 14799 13658 12680  8644

barplot(Complaints_received, col = "pink", xlab="Year", ylab="Count", main="Complaints Received")

Complaints_closed <- table(Close.Year)
Complaints_closed

## Close.Year
##  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016 
## 22116 24783 21430 24725 20069 17190 11421 19046 15134 15890 12593

barplot(Complaints_closed, col = "dark green", xlab="Year", ylab="Count", main="Complaints Closed")

ANALYSIS OF DATA BY BOROUGHS IN NYC

Let’s take a look at the spread or distribution of different data variables across the NYC Boroughs to investigate patterns, if any, by region/boroughs.

What is the distribution of all incidents across different boroughs in NYC?

First, let’s get an overview of how all incidents occur across the boroughs. To assess the most and least affected areas, it is important to understand the distribution of occurence of incidents across NYC in the different boroughs. From the pie chart below, it is clear that Brooklyn has the highest number of incidents while Staten Island has the least. This gives us a basic understanding to study further into the factors that may be causing this spread.

Borough_Count <- table(Borough.of.Occurrence)
Pct_Borough_Count <- round(Borough_Count/sum(Borough_Count)*100)
Borough_label <- names(Borough_Count)
Borough_label <- paste(Borough_label, Pct_Borough_Count)
Borough_label <- paste(Borough_label,"%",sep="")

pie(Borough_Count, labels = Borough_label, main="Incidents across different boroughs")

Is there a pattern to how incidents are investigated across different boroughs?

Let’s now see the distribution of full and incomplete investigation in different boroughs. This may help us understand if this is one of the factors contributing to the higher or lower occurence of incidents in the boroughs. From the bar graph below, the distribution seems to be fairly equal across the boroughs with more incidents not investigated fully. This inital analysis suggests that the investigation type does not seem to affect the occurence of incidents in the boroughs.

Borough <- unique(Borough.of.Occurrence)
Borough <- data.frame(Borough)
Full_Investigation_borough <- table(Is.Full.Investigation, Borough.of.Occurrence)

barplot(Full_Investigation_borough, main="Investigation Type across Boroughs",
        xlab="Boroughs", ylab="Number of incidents", col = c("red","dark green"),
        legend = rownames(Full_Investigation_borough))

What are the encounter outcomes in different boroughs?

An assessment of the encounter outcomes may provide some clues as to the factors affecting the occurrence of incidents across the boroughs. Brooklyn and Bronx have higher arrests but also have high number of no arrest or summons. Manhattan has a higher number of no arrests or summons compared to arrests. These observations suggest that, may be, more stringent actions such as more arrests could be taken to try to reduce the occurrence of incidents in these regions.

Encounter_outcome <- table(Encounter.Outcome, Borough.of.Occurrence)

barplot(Encounter_outcome, main="Encounter outcome by Boroughs",
        xlab="Boroughs", ylab="Encounter outcome", col=c("dark blue","light green","cyan", "maroon"),
        legend = rownames(Encounter_outcome), beside=TRUE)

What are the most common modes of filing a complaint in different boroughs?

Phone seems to be the most commonly used mode of filing a complaint across all regions followed by Call processing system. This does not really provide much information to assess any effect on occurrence of incidents in different boroughs.

Complaint.Filed.Mode_Borough <- table(Complaint.Filed.Mode, Borough.of.Occurrence)

barplot(Complaint.Filed.Mode_Borough, main="Mode of filing complaint by Boroughs",
        xlab="Boroughs", ylab="Complaint filing mode", col=c("yellow","light green","red", "maroon", "cyan", "gray", "blue"),
        legend = rownames(Complaint.Filed.Mode_Borough), beside=TRUE)

What are the most common incident locations in different boroughs?

From the graph, it is evident that Street/Highway is the most common incident location across all boroughs. Apartment/House seems to be the next biggest location especially in Brooklyn.

ggplot(ccrb, aes(Borough.of.Occurrence, fill=Incident.Location)) + 
  geom_bar() + coord_flip() + 
  labs(x="Boroughs", y="Count")

ANALYSIS OF ALLEGATIONS AND EVIDENCE

Types of Allegations

To get an idea of what are the least and most reported types of allegations, we can plot a bar graph that visualizes this from the data set. The plot below indicates that “Abuse of Authority” is the highest reported allegation type.

Allegation_type <- table(ccrb$Allegation.FADO.Type)
Allegation_type

## 
## Abuse of Authority        Discourtesy              Force 
##             102173              34452              61761 
## Offensive Language 
##               6008

options(scipen = 999)

barplot(Allegation_type, xlab="Type of Allegation", ylab="Count", col="orange")

Video Evidence by Incident Year

It will be helpful to understand when video evidence was available to see if that evidence could attribute to complaints received and/or closed that year. From the graph below, it can be seen that video evidence is available only from 2010 onwards with most of them present only between 2013 and 2015. This probably explains the lesser number of complaints received as having video surveillance could have discouraged offenders.

boxplot(Incident.Year~Complaint.Has.Video.Evidence,
        xlab = "Video Evidence",
        ylab = "Incident Year")

Summary

Almost all incidents that occurred are reported, not many go unreported
There was a peak in complaints received in 2006 and 2007. One of the explanations for this is that complaints were received from 1999 but CCRB started closing them only from 2006 which would’ve let to an accumulation of complaints
Brooklyn, Bronx and Manhattan have the highest number of incidents
Street/Highway and Apartment/House are where there is highest incidence of these incidents across all boroughs
“Abuse authority”" is the most reported incident

Exploratory Data Analysis of NYC Data Transparency Civilian Complain Review Board (CCRB) data

Swetha Prasad

January 21, 2018