R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Objectives The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to downloa d the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data. ### Deliverable and Grades

For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

This is a top section

This is a subsection

Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

First let’s look at the distribution of incidents over years. From the plot below it looks like the reporting of overall incidents has declined over years from 2006 to 2016. There may be an actual decline in incidents or the people reporting incidents may have decreased

library(readxl)
library(ggplot2)
ccrb_datatransparencyinitiative <- read_excel("~/Downloads/ccrb_datatransparencyinitiative.xlsx", 
    sheet = "Complaints_Allegations")
## Warning in strptime(x, format, tz = tz): unknown timezone 'default/America/
## New_York'
ccrb_data=ccrb_datatransparencyinitiative
inci.year<- unique(ccrb_data[c("UniqueComplaintId","Incident Year")])
inci.year<- data.frame(inci.year)
ggplot(inci.year,aes(Incident.Year))+geom_bar()

From the Stem and Leaf plot below we can see that the incidents peaked at 2009 and decreased from there on to 2016.

stem(inci.year$Incident.Year)
## 
##   The decimal point is at the |
## 
##   1999 | 00
##   2000 | 0
##   2001 | 
##   2002 | 000
##   2003 | 0000000
##   2004 | 00000000000000000000000000000000000000000000000000000000000000000000+122
##   2005 | 00000000000000000000000000000000000000000000000000000000000000000000+3344
##   2006 | 00000000000000000000000000000000000000000000000000000000000000000000+7618
##   2007 | 00000000000000000000000000000000000000000000000000000000000000000000+7464
##   2008 | 00000000000000000000000000000000000000000000000000000000000000000000+7263
##   2009 | 00000000000000000000000000000000000000000000000000000000000000000000+7549
##   2010 | 00000000000000000000000000000000000000000000000000000000000000000000+6381
##   2011 | 00000000000000000000000000000000000000000000000000000000000000000000+5932
##   2012 | 00000000000000000000000000000000000000000000000000000000000000000000+5675
##   2013 | 00000000000000000000000000000000000000000000000000000000000000000000+5330
##   2014 | 00000000000000000000000000000000000000000000000000000000000000000000+4670
##   2015 | 00000000000000000000000000000000000000000000000000000000000000000000+4322
##   2016 | 00000000000000000000000000000000000000000000000000000000000000000000+2769

Now let’s look at the distribution of incidents over different areas in NYC. From the plot below it looks like there is an even decrease of incidents over the years in all the Boroughs. Staten Island and Queens have the smallest number of incidents compared to the others.Brooklyn, Bronx and Manhattan have the highest incidents compared to other areas.

area.year<- unique(ccrb_data[c("UniqueComplaintId","Incident Year","Borough of Occurrence")])
area.year<- data.frame(area.year)
ggplot(area.year,aes(Incident.Year,fill=Borough.of.Occurrence))+geom_bar()

ggplot(area.year,aes(Incident.Year,color=Borough.of.Occurrence))+geom_freqpoly(binwidth=1)

Now let’s look at the location of incidents over different areas in NYC. From the plot below it looks like the Street/Highway reported incidents was higher in 2006-2009 and then teadily decreased from 2010 - 2016. Whereas with the other locations, we did not see this huge change.

loc.area.year<- unique(ccrb_data[c("UniqueComplaintId","Incident Year","Incident Location")])
loc.area.year<- as.data.frame(table(loc.area.year$`Incident Year`,loc.area.year$`Incident Location`))
ggplot(loc.area.year,aes(Var1,Freq,color=Var2))+geom_point()

Now let’s look at mode of reporting incidents and whether there is a preference for one method over the other. From the plot below it looks like Phone is highly used as the reporting mode, next is the Call Processing System and then comes the online website. In the more recent years, online website reporting has increased compared to previous years whereas phone and call processing system mode have decreased.

mode.year<- data.frame(unique(ccrb_data[c("UniqueComplaintId","Incident Year","Complaint Filed Mode")]))
ggplot(mode.year,aes(Complaint.Filed.Mode))+geom_bar()

mode.year<- as.data.frame(with(mode.year,table(Incident.Year,Complaint.Filed.Mode)))
ggplot(mode.year,aes(Incident.Year,Freq,color=Complaint.Filed.Mode))+geom_point()

Now let’s look at the reasons for initial contact of incident reporting. “P/D suspected C/V of Violation/Crime - Street” is the number one reason for initial contact of incident reporting.

reason.year<- data.frame(unique(ccrb_data[c("UniqueComplaintId","Incident Year","Reason For Initial Contact")]))
order<- data.frame(sort(table(reason.year$Reason.For.Initial.Contact),decreasing = TRUE))
ggplot(order[1:10,],aes(Var1,Freq))+geom_point()+coord_flip()

Now let’s look at Encounter outcomes for incidents. A majority of the complaints results in “No Arrests or Summons”. There were Arrests in some cases which we need to explore more. Therefore the second plot shows that majority of the cases that were suspected as violation/crime in the street led to arrests.

outcome.year<- data.frame(unique(ccrb_data[c("UniqueComplaintId","Incident Year","Reason For Initial Contact","Encounter Outcome")]))
order<- data.frame(sort(table(outcome.year$Encounter.Outcome),decreasing = TRUE))
ggplot(order[1:4,],aes(Var1,Freq))+geom_point()

reasons<- data.frame(sort(table(outcome.year$Reason.For.Initial.Contact),decreasing = TRUE))
outcome.year<- as.data.frame(outcome.year[outcome.year$Reason.For.Initial.Contact %in% reasons$Var1[1:5],])
ggplot(outcome.year,aes(Reason.For.Initial.Contact,fill=Encounter.Outcome))+geom_bar()+coord_flip()

Summary As we have seen in this Exploratory Data Analysis of Civilian incident reports from CCRB. We discovered several important trends.

1.Reporting of overall incidents has declined over years from 2006 to 2016 2.Brooklyn, Bronx and Manhattan have the highest incidents, whereas Staten island and Queens have the lowest incidents 3.Street/Highway reported incidents was higher in 2006-2009 and then teadily decreased from 2010 - 2016 4.Phone is highly used as the reporting mode and next is call processing system 5.“P/D suspected C/V of Violation/Crime - Street” is the number one reason for initial contact of incident reporting 6.Majority of the cases that were suspected as violation/crime in the street led to arrests