The objective of this assingment is to conduct an exploratory data analysis of the NYC Data Transparnecy Initiative. This database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Our objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
In this section we use the “readxl” package to read large excel files and load data.
library(readxl)
library(ggplot2)
ccrb_data<- read_excel("/Users/yousiyan/Downloads/ccrb_datatransparencyinitiative.xlsx",sheet = "Complaints_Allegations")
First let’s look at the distribution of incidents over years.
#Please note that we do have same complaints that have more than one entries, thus we need to get Unique incident ids in order to be more clarified
inci.year<- unique(ccrb_data[c("UniqueComplaintId","Incident Year")])
inci.year<- data.frame(inci.year)
ggplot(inci.year,aes(Incident.Year))+geom_bar()
From the plot above, it looks like the reporting of overall incidents has increased dramatically from 2005-2006, almost stay steady over years from 2006-2009, however, declined smmothly over years from 2009 to 2016. There may be an actual decline in incidents over these years or may be just the people reporting incidents have decreased.
stem(inci.year$Incident.Year)
##
## The decimal point is at the |
##
## 1999 | 00
## 2000 | 0
## 2001 |
## 2002 | 000
## 2003 | 0000000
## 2004 | 00000000000000000000000000000000000000000000000000000000000000000000+122
## 2005 | 00000000000000000000000000000000000000000000000000000000000000000000+3344
## 2006 | 00000000000000000000000000000000000000000000000000000000000000000000+7618
## 2007 | 00000000000000000000000000000000000000000000000000000000000000000000+7464
## 2008 | 00000000000000000000000000000000000000000000000000000000000000000000+7263
## 2009 | 00000000000000000000000000000000000000000000000000000000000000000000+7549
## 2010 | 00000000000000000000000000000000000000000000000000000000000000000000+6381
## 2011 | 00000000000000000000000000000000000000000000000000000000000000000000+5932
## 2012 | 00000000000000000000000000000000000000000000000000000000000000000000+5675
## 2013 | 00000000000000000000000000000000000000000000000000000000000000000000+5330
## 2014 | 00000000000000000000000000000000000000000000000000000000000000000000+4670
## 2015 | 00000000000000000000000000000000000000000000000000000000000000000000+4322
## 2016 | 00000000000000000000000000000000000000000000000000000000000000000000+2769
From the Stem and Leaf plot above we can see that the incidents peaked at 2006 and decreased from 2009 to 2016.
Now let’s look at the distribution of incidents over different areas in NYC.
area.year<- unique(ccrb_data[c("UniqueComplaintId","Incident Year","Borough of Occurrence")])
area.year<- data.frame(area.year)
ggplot(area.year,aes(Incident.Year,fill=Borough.of.Occurrence))+geom_bar()
From the plot above, it looks like there is an even decrease of incidents over the years in all the Boroughs. Staten Island has the smallest number of incidents compared to the others. Brooklyn has the highest incidents compared to other areas.
ggplot(area.year,aes(Incident.Year,color=Borough.of.Occurrence))+geom_freqpoly(binwidth=1)
From this plot above, we can see that almost all areas have a trenmendous increase from 2004-2005, all started to decrease after 2006, each borough of occurrence alomst has the same number of incidents at 2016. Brooklyn has the highest counts of incidents, Manhattan and Bronx are almost the same, Queens follows the next, and Staten Island has the lowest counts of incidents.
Now let’s look at the location of incidents over different areas in NYC.
loc.area.year<- unique(ccrb_data[c("UniqueComplaintId","Incident Year","Incident Location")])
loc.area.year<- as.data.frame(table(loc.area.year$`Incident Year`,loc.area.year$`Incident Location`))
ggplot(loc.area.year,aes(Var1,Freq,color=Var2))+geom_point()
From the plot below it looks like the Street/highway reported incidents has the biggesdt change over the years, the frequency was higher in 2006-2009 and then steadyly decreased from 2010 to 2016. Apartment/house has a slightly increase from 2006-2009, remains steady since 2010 until 2015, Whereas with the other locations, we did not see this huge change.
Now let’s look at mode of reporting incidents and whether there is a preference for one method over the other.
mode.year<- data.frame(unique(ccrb_data[c("UniqueComplaintId","Incident Year","Complaint Filed Mode")]))
ggplot(mode.year,aes(Complaint.Filed.Mode,colors = Complaint.Filed.Mode))+geom_bar()
From the above plot, it looks like Phone is highly used as the reporting mode, next is the Call Processing System and then comes the online website.
mode.year<- as.data.frame(with(mode.year,table(Incident.Year,Complaint.Filed.Mode)))
ggplot(mode.year,aes(Incident.Year,Freq,color=Complaint.Filed.Mode))+geom_point()
From the above plot, we can see that in the more recent years, online website reporting has increased compared to previous years whereas phone and call processing system mode have decreased. Phone mode has decreased trendously from 2009-2016.
Now let’s look at the reasons for initial contact of incident reporting.
reason.year<- data.frame(unique(ccrb_data[c("UniqueComplaintId","Incident Year","Reason For Initial Contact")]))
order<- data.frame(sort(table(reason.year$Reason.For.Initial.Contact),decreasing = TRUE))
ggplot(order[1:15,],aes(Var1,Freq))+geom_point()+coord_flip()
From the above plot, “P/D suspected C/V of Violation/Crime - Street” is the No.1 reason for initial contact of incident reporting. Other followed as the secound, we didn’t see this high frequency in other reasons.
Now let’s look at Encounter outcomes for incidents.
outcome.year<- data.frame(unique(ccrb_data[c("UniqueComplaintId","Incident Year","Reason For Initial Contact","Encounter Outcome")]))
order<- data.frame(sort(table(outcome.year$Encounter.Outcome),decreasing = TRUE))
ggplot(order[1:4,],aes(Var1,Freq))+geom_point()
A majority of the complaints results fall in “No Arrests or Summons”. The second result goes to “Arrest”,which we will explore more in the second plot.
Now let’s look at encounter outcomes for incidents and its relation to reasons for initial contact
reasons<- data.frame(sort(table(outcome.year$Reason.For.Initial.Contact),decreasing = TRUE))
outcome.year<- as.data.frame(outcome.year[outcome.year$Reason.For.Initial.Contact %in% reasons$Var1[1:5],])
ggplot(outcome.year,aes(Reason.For.Initial.Contact,fill=Encounter.Outcome))+geom_bar()+coord_flip()
This plot shows that majority of the cases that were suspected as violation/crime in the street led to arrests.The majority of the cases that were suspected as other led to no Arrest or Summons.
As we have seen in this Exploratory Data Analysis of Civilian incident reports from CCRB. We discovered several important trends as followed.