The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:
library(plyr, lib.loc="C:/software/Rpackages")
library(dplyr, lib.loc="C:/software/Rpackages")
library(ggplot2, lib.loc="C:/software/Rpackages")
library(ggthemes, lib.loc="C:/software/Rpackages")
library(wordcloud, lib.loc="C:/software/Rpackages")
library(dygraphs, lib.loc="C:/software/Rpackages")
library(readxl, lib.loc="C:/software/Rpackages")
library(forcats, lib="C:/software/Rpackages")
library(treemap, lib="C:/software/Rpackages")
vData=read_excel("C:\\Users\\varun.bhagat\\Downloads\\New folder\\ccrb_datatransparencyinitiative.xlsx")
ggplot(vData, aes(x=vData$`Received Year`)) + geom_histogram(binwidth = 1, color="black") +theme_economist()+ geom_smooth(stat = 'bin', binwidth=1,color="yellow")+ labs( x = "Year", y = "# Complaints", title ="# Complaints per year")
vData=subset(vData,vData$`Received Year`>2004)
ggplot(vData, aes(x=vData$`Received Year`,color=vData$`Borough of Occurrence`,fill=vData$`Borough of Occurrence`)) + geom_bar() +theme_economist()+ labs( x = "Year", y = "# Complaints", title ="# Complaints per year(Since 2005)") + scale_fill_discrete(name = "Borough")+geom_smooth(stat = 'bin', binwidth=1) + guides(color=FALSE)
ggplot(vData, aes(x=vData$`Borough of Occurrence`,fill=vData$`Encounter Outcome`)) + geom_bar() +theme_economist()+ labs( x = "Borough", y = "# Complaints", title ="# Encounter Outcome by Borough(Since 2005)") + scale_fill_discrete(name = "Outcome")
vSumm=vData %>%
group_by(vData$`Borough of Occurrence`, vData$`Encounter Outcome`) %>%
dplyr::summarise(n = n()) %>%
dplyr::mutate(freq = n*100 / sum(n))
colnames(vSumm)=c("Borough","Outcome","Count","Per")
ggplot(vSumm, aes(x=vSumm$Borough,fill=vSumm$Outcome,y=vSumm$Per)) + geom_bar(stat = "identity") + theme_economist()+ labs( x = "Borough", y = "# Complaints", title ="# Encounter Outcome by Borough(Since 2005)") + scale_fill_discrete(name = "Outcome")
vArrests=subset(vSumm,vSumm$Outcome=="Arrest")
vArrests=vArrests %>%
dplyr::mutate(val = round((Count - mean(Count))/sd(Count), 2)) %>%
dplyr::mutate(typ = ifelse(val < 0, "below", "above")) %>%
mutate(Borough = fct_reorder(Borough, val, .desc = TRUE))
ggplot(vArrests , aes(x=vArrests$Borough, y=vArrests$val, label=vArrests$val)) +
geom_bar(stat='identity', aes(fill=vArrests$typ)) +
scale_fill_manual(name="Mileage",
labels = c("Above Average", "Below Average"),
values = c("above"="#00ba38", "below"="#f8766d")) +
labs(subtitle="Normalised # Arrests (Since 2005)",
title= "Diverging Bars") +
coord_flip()
vTree=vData %>%
group_by(vData$`Allegation FADO Type`, vData$`Allegation Description`) %>%
dplyr::summarise(n = n())
colnames(vTree)=c("Type","Desc","Count")
treemap(vTree,
index=c("Type","Desc"),
vSize = "Count",
type="index",
title="Allegation Description distribution by FADO Type (Since 2005)",
fontsize.title = 8
)
vTree=vData %>%
group_by(vData$`Reason For Initial Contact`) %>%
dplyr::summarise(n = n()) %>%
dplyr::arrange(desc(n))
colnames(vTree)=c("Reason","Count")
treemap(vTree,
index=c("Reason"),
vSize = "Count",
type="index",
title="Distribution of Reason for Initial Contact (Since 2005)",
fontsize.title = 8
)
ggplot(vData, aes(x=vData$`Borough of Occurrence`, fill=vData$`Incident Location`)) + geom_histogram(stat = "count") + labs(title="Incident Location of Complaints by Borough (Since 2005 ", x="Borough", y="Number of Complaints") + scale_fill_discrete(name="Incident Location") +theme_economist()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
vArrests=vData %>%
dplyr::mutate(Arrest = ifelse(vData$`Encounter Outcome`=="Arrest", 1, 0)) %>%
group_by(vData$`Allegation Description`) %>%
dplyr::summarise(n = n(),SA=sum(Arrest)) %>%
dplyr::mutate(Arrest_per=SA*100/n) %>%
dplyr::arrange(desc(Arrest_per))
colnames(vArrests)=c("Allegation","Count", "Arrest Count","Arrest_Per")
vArrests = vArrests %>%
mutate(Allegation = fct_reorder(Allegation, Arrest_Per, .desc = TRUE))
vArrests=head(vArrests,25)
ggplot(vArrests, aes(x=vArrests$Allegation,y=vArrests$`Arrest_Per`)) + geom_bar(stat = "identity") + theme_economist()+ coord_flip()+ labs( x = "Allegation - Top 25", y = "Arrest Percent", title ="Likelihood of Arrest by Allegation Desc - Top 25 (Since 2005)")
ggplot(vData, aes(x=vData$`Borough of Occurrence`, fill=vData$`Incident Location`)) +
geom_histogram(stat = "count") +
labs(title="Incident Location of Complaints by Borough (Since 2005) ", x="Borough", y="Number of Complaints") +
scale_fill_discrete(name="Incident Location") +theme_economist()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Here’s what I found: Number of complaints received per year have been decreasing every year since 2007. Data prior to 2005 doesn’t appear to be complete and hence removed from further analysis. Most complaints are filed in Brookyln and least in Staten Island. In all boroughs, the number of cases that lead to arrests is almost the same as the number of cases that don’t lead to arrests. Brooklyn has the highest divergence (from average) when it comes to number of arrests. Over 50% of the complaints are related to “Abuse of Authority” The biggest reason for initial contact is “PD suspected C/V of violation/crime” Phone is the most common mode of lodging a complaint “Radio at Club” is most likely to get you arrested “Threat of summons” is least likely to get you arrested Most of the Incidents happen on the “Streets/Highway”