Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:

library(plyr, lib.loc="C:/software/Rpackages")
library(dplyr, lib.loc="C:/software/Rpackages")
library(ggplot2, lib.loc="C:/software/Rpackages")
library(ggthemes, lib.loc="C:/software/Rpackages")
library(wordcloud, lib.loc="C:/software/Rpackages")
library(dygraphs, lib.loc="C:/software/Rpackages")
library(readxl, lib.loc="C:/software/Rpackages")
library(forcats, lib="C:/software/Rpackages")
library(treemap, lib="C:/software/Rpackages")

vData=read_excel("C:\\Users\\varun.bhagat\\Downloads\\New folder\\ccrb_datatransparencyinitiative.xlsx")

Number of complaints received

Shows the Number of complaints received every year

ggplot(vData, aes(x=vData$`Received Year`)) + geom_histogram(binwidth = 1, color="black") +theme_economist()+ geom_smooth(stat = 'bin', binwidth=1,color="yellow")+ labs( x = "Year", y = "# Complaints", title ="# Complaints per year")

vData=subset(vData,vData$`Received Year`>2004)

Mode of Filing Complaints

Gives the breakdown of # Complaints per year by Borough

ggplot(vData, aes(x=vData$`Received Year`,color=vData$`Borough of Occurrence`,fill=vData$`Borough of Occurrence`)) + geom_bar() +theme_economist()+ labs( x = "Year", y = "# Complaints", title ="# Complaints per year(Since 2005)") + scale_fill_discrete(name = "Borough")+geom_smooth(stat = 'bin', binwidth=1)  + guides(color=FALSE)

Encounter Outcome by Borough

Gives the breakdown of # Complaints per Borough by Outcome

ggplot(vData, aes(x=vData$`Borough of Occurrence`,fill=vData$`Encounter Outcome`)) + geom_bar() +theme_economist()+ labs( x = "Borough", y = "# Complaints", title ="# Encounter Outcome by Borough(Since 2005)") + scale_fill_discrete(name = "Outcome")

vSumm=vData %>%
  group_by(vData$`Borough of Occurrence`, vData$`Encounter Outcome`) %>%
  dplyr::summarise(n = n()) %>%
  dplyr::mutate(freq = n*100 / sum(n))

colnames(vSumm)=c("Borough","Outcome","Count","Per")

Encounter Outcome by Borough (Since 2005)

Gives the breakdown (%) of Outcomes by Borough

ggplot(vSumm, aes(x=vSumm$Borough,fill=vSumm$Outcome,y=vSumm$Per)) + geom_bar(stat = "identity") + theme_economist()+ labs( x = "Borough", y = "# Complaints", title ="# Encounter Outcome by Borough(Since 2005)") + scale_fill_discrete(name = "Outcome")

Diverging Bars - Arrests accross Boroughs (Since 2005)

Shows the divergence of # Arrests for each Borough from the average

vArrests=subset(vSumm,vSumm$Outcome=="Arrest")
vArrests=vArrests %>%
dplyr::mutate(val = round((Count - mean(Count))/sd(Count), 2)) %>%
dplyr::mutate(typ = ifelse(val < 0, "below", "above"))  %>%
mutate(Borough = fct_reorder(Borough, val, .desc = TRUE))


ggplot(vArrests , aes(x=vArrests$Borough, y=vArrests$val, label=vArrests$val)) + 
  geom_bar(stat='identity', aes(fill=vArrests$typ))  +
  scale_fill_manual(name="Mileage", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("above"="#00ba38", "below"="#f8766d")) + 
  labs(subtitle="Normalised # Arrests (Since 2005)", 
       title= "Diverging Bars") + 
  coord_flip()

Allegation Description distribution by FADO Type (Since 2005)

Gives the distribution of different Allegation descriptions by Fado Type

vTree=vData %>%
  group_by(vData$`Allegation FADO Type`, vData$`Allegation Description`) %>%
  dplyr::summarise(n = n())
colnames(vTree)=c("Type","Desc","Count")

            
treemap(vTree,
        index=c("Type","Desc"),  
        vSize = "Count",  
        type="index", 
        title="Allegation Description distribution by FADO Type (Since 2005)", 
        fontsize.title = 8
)

Distribution of Reason for Initial Contact (Since 2005)

Gives the distribution of Reason for Initial Contact

vTree=vData %>%
  group_by(vData$`Reason For Initial Contact`) %>%
  dplyr::summarise(n = n()) %>%
  dplyr::arrange(desc(n))
colnames(vTree)=c("Reason","Count")


treemap(vTree, 
        index=c("Reason"),  
        vSize = "Count",  
        type="index", 
        title="Distribution of Reason for Initial Contact (Since 2005)", 
        fontsize.title = 8
)

Number of complaints received each year by Mode (Since 2005)

To look at the trend of each complaint mode in each year.

ggplot(vData, aes(x=vData$`Borough of Occurrence`, fill=vData$`Incident Location`)) +   geom_histogram(stat = "count") +    labs(title="Incident Location of Complaints by Borough (Since 2005 ", x="Borough", y="Number of Complaints") +   scale_fill_discrete(name="Incident Location") +theme_economist()

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Likelihood of Arrest by Allegation Desc (Since 2005)

Gives the Likelihood of Arrest by Allegation Desc

vArrests=vData %>%
  dplyr::mutate(Arrest = ifelse(vData$`Encounter Outcome`=="Arrest", 1, 0))  %>% 
  group_by(vData$`Allegation Description`) %>%
  dplyr::summarise(n = n(),SA=sum(Arrest)) %>%
  dplyr::mutate(Arrest_per=SA*100/n) %>%
  dplyr::arrange(desc(Arrest_per)) 
  
colnames(vArrests)=c("Allegation","Count", "Arrest Count","Arrest_Per")

vArrests = vArrests %>% 
  mutate(Allegation = fct_reorder(Allegation, Arrest_Per, .desc = TRUE))
vArrests=head(vArrests,25)
 
ggplot(vArrests, aes(x=vArrests$Allegation,y=vArrests$`Arrest_Per`)) + geom_bar(stat = "identity") + theme_economist()+  coord_flip()+ labs( x = "Allegation - Top 25", y = "Arrest Percent", title ="Likelihood of Arrest by Allegation Desc - Top 25 (Since 2005)")

Incident Location of Complaints by Borough (Since 2005)

Review proportions of incident location in each borough

ggplot(vData, aes(x=vData$`Borough of Occurrence`, fill=vData$`Incident Location`)) + 
  geom_histogram(stat = "count") + 
  labs(title="Incident Location of Complaints by Borough (Since 2005) ", x="Borough", y="Number of Complaints") +
  scale_fill_discrete(name="Incident Location") +theme_economist()

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Summary

Here’s what I found: Number of complaints received per year have been decreasing every year since 2007. Data prior to 2005 doesn’t appear to be complete and hence removed from further analysis. Most complaints are filed in Brookyln and least in Staten Island. In all boroughs, the number of cases that lead to arrests is almost the same as the number of cases that don’t lead to arrests. Brooklyn has the highest divergence (from average) when it comes to number of arrests. Over 50% of the complaints are related to “Abuse of Authority” The biggest reason for initial contact is “PD suspected C/V of violation/crime” Phone is the most common mode of lodging a complaint “Radio at Club” is most likely to get you arrested “Threat of summons” is least likely to get you arrested Most of the Incidents happen on the “Streets/Highway”

ANLY 512 - Problem Set 5

Exploratory Data Analysis

Varun Bhagat

2017-09-12