The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualization approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that iterative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municipal agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.4
data <- read.csv('/Users/sarasijghosh/Documents/HU Classes/Late Spring2018/Anly512-DataViz/ccrb_datatransparencyinitiative.csv',header = TRUE)
summary(is.na(data))
## DateStamp UniqueComplaintId Close.Year Received.Year
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:204397 FALSE:204397 FALSE:204397 FALSE:204397
##
## Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
## Mode :logical Mode :logical Mode :logical
## FALSE:203914 FALSE:204397 FALSE:204397
## TRUE :483
## Complaint.Filed.Mode Complaint.Filed.Place
## Mode :logical Mode :logical
## FALSE:204397 FALSE:204397
##
## Complaint.Contains.Stop...Frisk.Allegations Incident.Location
## Mode :logical Mode :logical
## FALSE:204397 FALSE:201041
## TRUE :3356
## Incident.Year Encounter.Outcome Reason.For.Initial.Contact
## Mode :logical Mode :logical Mode :logical
## FALSE:204397 FALSE:204397 FALSE:203542
## TRUE :855
## Allegation.FADO.Type Allegation.Description
## Mode :logical Mode :logical
## FALSE:204394 FALSE:204394
## TRUE :3 TRUE :3
summary(data)
## DateStamp UniqueComplaintId Close.Year Received.Year
## 11/29/2016:204397 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :34794 Median :2010 Median :2009
## Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :69492 Max. :2016 Max. :2016
##
## Borough.of.Occurrence Is.Full.Investigation
## Bronx :49442 Mode :logical
## Brooklyn :72215 FALSE:107084
## Manhattan :42104 TRUE :97313
## Outside NYC : 170
## Queens :30883
## Staten Island: 9100
## NA's : 483
## Complaint.Has.Video.Evidence Complaint.Filed.Mode
## Mode :logical Call Processing System: 42447
## FALSE:195530 E-mail : 799
## TRUE :8867 Fax : 356
## In-person : 9586
## Mail : 3424
## On-line website : 14197
## Phone :133588
## Complaint.Filed.Place Complaint.Contains.Stop...Frisk.Allegations
## CCRB :130877 Mode :logical
## IAB : 69214 FALSE:119856
## Precinct : 3548 TRUE :84541
## Other City agency: 295
## Mayor's Office : 157
## Other : 110
## (Other) : 196
## Incident.Location Incident.Year Encounter.Outcome
## Street/highway :123274 Min. :1999 Arrest :89139
## Apartment/house : 34720 1st Qu.:2007 No Arrest or Summons:82964
## Residential building: 12421 Median :2009 Other/NA : 1050
## Police building : 8968 Mean :2010 Summons :31244
## Subway station/train: 6077 3rd Qu.:2012
## (Other) : 15581 Max. :2016
## NA's : 3356
## Reason.For.Initial.Contact
## PD suspected C/V of violation/crime - street:60107
## Other :39030
## PD suspected C/V of violation/crime - bldg :16067
## PD suspected C/V of violation/crime - auto :12953
## Moving violation : 8843
## (Other) :66542
## NA's : 855
## Allegation.FADO.Type
## Abuse of Authority:102173
## Discourtesy : 34452
## Force : 61761
## Offensive Language: 6008
## NA's : 3
##
##
## Allegation.Description
## Physical force :44116
## Word :31704
## Stop :12944
## Search (of person) :12250
## Refusal to provide name/shield number:10359
## (Other) :93021
## NA's : 3
ggplot(data, aes(x=data$Received.Year, fill= data$Allegation.FADO.Type)) + geom_histogram(stat = "count") + labs (title = "Number of Complaints Received Each Year", x="Received Year", y="Number of Complaints") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Allegation Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(data, aes(x=data$Borough.of.Occurrence, fill= data$Allegation.FADO.Type)) + geom_histogram(stat = "count") + labs (title = "Frequency of Incident Occurence by Borough and Type", x="Borough of Occurence", y="Frequency of Occurrence") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Allegation Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(data, aes(x=data$Is.Full.Investigation, fill= data$Complaint.Has.Video.Evidence)) + geom_bar(stat = "count") + labs (title = "Investigation by Evidence", x="Is Full Investigation", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Has Video Evidence")
ggplot(data, aes(x=data$Incident.Year, fill= data$Complaint.Has.Video.Evidence)) + geom_histogram(stat = "count") + labs (title = "Number of Incident Occurred Each Year by Evidence", x="Incident Year", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Has Video Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(data, aes(x=data$Close.Year, fill= data$Is.Full.Investigation)) + geom_histogram(stat = "count") + labs (title = "Number of Case Closed Each Year by Investigation", x="Close Year", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Is Full Investigation")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(data, aes(x=data$Incident.Year, fill= data$Encounter.Outcome)) + geom_histogram(stat = "count") + labs (title = "Number of Incidents Occurred Each Year by Outcome", x="Incident Year", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Encountered Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(data, aes(x=data$Encounter.Outcome, fill= data$Borough.of.Occurrence)) + geom_bar(stat = "count") + labs (title = "Encounter Outcome by Borough", x="Encounter Outcome", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Borough of Occurrence")
ggplot(data, aes(x=data$Incident.Year, y= data$Received.Year)) + geom_point() + geom_smooth(method = lm) + labs (title = "Relationship between Incident Year and Received Year", x="Incident Year", y="Received Year")
ggplot(data, aes(x=data$Received.Year, y= data$Close.Year)) + geom_point() + geom_smooth(method = lm) + labs (title = "Relationship between Close Year and Received Year", x="Received Year", y="Close Year")
ggplot(data, aes(x=data$Allegation.FADO.Type, fill= data$Allegation.FADO.Type)) + geom_bar(stat = "count") + labs (title = "Number of Complains by Allegation Type", x="Type", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Type")
## It helps us understanding the role of data visualization and how raw data can be used to understand different scenarios of any case study