The objective here is to perform exploratory data analysis to understand the NYC Civilian Complain Review Board (CCRB) dataset and identify patterns and trends within the dataset by creating different data visualizations.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
ccrb<-read.csv("ccrb_datatransparencyinitiative.csv")
attach(ccrb)
It will be good to get an overall sense of the spread of the data over the years to understand the peaks and troughs.
The years 2006 to 2010 saw the highest number of complaints received. Since 2007, the complaints have been reducing relatively to previous years. It will be interesting to understand the trends of other variables to see what may have caused the peak in 2006 and 2007.
histogram <- hist(Received.Year, col="pink", xlab="Year",
main="Number of complaints received by year")
xfit<-seq(min(Received.Year),max(Received.Year),length=50)
yfit<-dnorm(xfit,mean=mean(Received.Year),sd=sd(Received.Year))
yfit <- yfit*diff(histogram$mids[1:2])*length(Received.Year)
lines(xfit, yfit, col="black", lwd=2)
The number of complaints by incident year is very similar to the number of complaints received per year, which means not many incidents are unreported.
par(mfrow=c(2,1))
Num_Incidents <- table(Incident.Year)
plot(Num_Incidents, type = "o", xlab="Year of Incident", ylab="Count", main="Number of incidents occurred")
Num_Complaints_Received <- table(Received.Year)
plot(Num_Complaints_Received, type = "o", xlab="Year Received", ylab="Count", main="Number of complaints received")
A look at the visualization below clearly indicates that the CCRB started receiving complaints from the year 1999 but started closing them only from 2006 which probably explains the peak in complaints received in 2006 and 2007.
par(mfrow=c(2,1))
Complaints_received <- table(Received.Year)
Complaints_received
## Received.Year
## 1999 2000 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## 13 6 15 25 864 11935 23864 24384 22191 21365 17817 16454
## 2012 2013 2014 2015 2016
## 15683 14799 13658 12680 8644
barplot(Complaints_received, col = "pink", xlab="Year", ylab="Count", main="Complaints Received")
Complaints_closed <- table(Close.Year)
Complaints_closed
## Close.Year
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## 22116 24783 21430 24725 20069 17190 11421 19046 15134 15890 12593
barplot(Complaints_closed, col = "dark green", xlab="Year", ylab="Count", main="Complaints Closed")
Let’s take a look at the spread or distribution of different data variables across the NYC Boroughs to investigate patterns, if any, by region/boroughs.
First, let’s get an overview of how all incidents occur across the boroughs. To assess the most and least affected areas, it is important to understand the distribution of occurence of incidents across NYC in the different boroughs. From the pie chart below, it is clear that Brooklyn has the highest number of incidents while Staten Island has the least. This gives us a basic understanding to study further into the factors that may be causing this spread.
Borough_Count <- table(Borough.of.Occurrence)
Pct_Borough_Count <- round(Borough_Count/sum(Borough_Count)*100)
Borough_label <- names(Borough_Count)
Borough_label <- paste(Borough_label, Pct_Borough_Count)
Borough_label <- paste(Borough_label,"%",sep="")
pie(Borough_Count, labels = Borough_label, main="Incidents across different boroughs")
Let’s now see the distribution of full and incomplete investigation in different boroughs. This may help us understand if this is one of the factors contributing to the higher or lower occurence of incidents in the boroughs. From the bar graph below, the distribution seems to be fairly equal across the boroughs with more incidents not investigated fully. This inital analysis suggests that the investigation type does not seem to affect the occurence of incidents in the boroughs.
Borough <- unique(Borough.of.Occurrence)
Borough <- data.frame(Borough)
Full_Investigation_borough <- table(Is.Full.Investigation, Borough.of.Occurrence)
barplot(Full_Investigation_borough, main="Investigation Type across Boroughs",
xlab="Boroughs", ylab="Number of incidents", col = c("red","dark green"),
legend = rownames(Full_Investigation_borough))
An assessment of the encounter outcomes may provide some clues as to the factors affecting the occurrence of incidents across the boroughs. Brooklyn and Bronx have higher arrests but also have high number of no arrest or summons. Manhattan has a higher number of no arrests or summons compared to arrests. These observations suggest that, may be, more stringent actions such as more arrests could be taken to try to reduce the occurrence of incidents in these regions.
Encounter_outcome <- table(Encounter.Outcome, Borough.of.Occurrence)
barplot(Encounter_outcome, main="Encounter outcome by Boroughs",
xlab="Boroughs", ylab="Encounter outcome", col=c("dark blue","light green","cyan", "maroon"),
legend = rownames(Encounter_outcome), beside=TRUE)
Phone seems to be the most commonly used mode of filing a complaint across all regions followed by Call processing system. This does not really provide much information to assess any effect on occurrence of incidents in different boroughs.
Complaint.Filed.Mode_Borough <- table(Complaint.Filed.Mode, Borough.of.Occurrence)
barplot(Complaint.Filed.Mode_Borough, main="Mode of filing complaint by Boroughs",
xlab="Boroughs", ylab="Complaint filing mode", col=c("yellow","light green","red", "maroon", "cyan", "gray", "blue"),
legend = rownames(Complaint.Filed.Mode_Borough), beside=TRUE)
From the graph, it is evident that Street/Highway is the most common incident location across all boroughs. Apartment/House seems to be the next biggest location especially in Brooklyn.
ggplot(ccrb, aes(Borough.of.Occurrence, fill=Incident.Location)) +
geom_bar() + coord_flip() +
labs(x="Boroughs", y="Count")
To get an idea of what are the least and most reported types of allegations, we can plot a bar graph that visualizes this from the data set. The plot below indicates that “Abuse of Authority” is the highest reported allegation type.
Allegation_type <- table(ccrb$Allegation.FADO.Type)
Allegation_type
##
## Abuse of Authority Discourtesy Force
## 102173 34452 61761
## Offensive Language
## 6008
options(scipen = 999)
barplot(Allegation_type, xlab="Type of Allegation", ylab="Count", col="orange")
It will be helpful to understand when video evidence was available to see if that evidence could attribute to complaints received and/or closed that year. From the graph below, it can be seen that video evidence is available only from 2010 onwards with most of them present only between 2013 and 2015. This probably explains the lesser number of complaints received as having video surveillance could have discouraged offenders.
boxplot(Incident.Year~Complaint.Has.Video.Evidence,
xlab = "Video Evidence",
ylab = "Incident Year")