Objectives

The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualization approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that iterative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municipal agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverables and Grades

For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

  1. Number of complaints received each year
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.4
data <- read.csv('/Users/sarasijghosh/Documents/HU Classes/Late Spring2018/Anly512-DataViz/ccrb_datatransparencyinitiative.csv',header = TRUE)
summary(is.na(data))
##  DateStamp       UniqueComplaintId Close.Year      Received.Year  
##  Mode :logical   Mode :logical     Mode :logical   Mode :logical  
##  FALSE:204397    FALSE:204397      FALSE:204397    FALSE:204397   
##                                                                   
##  Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
##  Mode :logical         Mode :logical         Mode :logical               
##  FALSE:203914          FALSE:204397          FALSE:204397                
##  TRUE :483                                                               
##  Complaint.Filed.Mode Complaint.Filed.Place
##  Mode :logical        Mode :logical        
##  FALSE:204397         FALSE:204397         
##                                            
##  Complaint.Contains.Stop...Frisk.Allegations Incident.Location
##  Mode :logical                               Mode :logical    
##  FALSE:204397                                FALSE:201041     
##                                              TRUE :3356       
##  Incident.Year   Encounter.Outcome Reason.For.Initial.Contact
##  Mode :logical   Mode :logical     Mode :logical             
##  FALSE:204397    FALSE:204397      FALSE:203542              
##                                    TRUE :855                 
##  Allegation.FADO.Type Allegation.Description
##  Mode :logical        Mode :logical         
##  FALSE:204394         FALSE:204394          
##  TRUE :3              TRUE :3
summary(data)
##       DateStamp      UniqueComplaintId   Close.Year   Received.Year 
##  11/29/2016:204397   Min.   :    1     Min.   :2006   Min.   :1999  
##                      1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##                      Median :34794     Median :2010   Median :2009  
##                      Mean   :34778     Mean   :2010   Mean   :2010  
##                      3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##                      Max.   :69492     Max.   :2016   Max.   :2016  
##                                                                     
##    Borough.of.Occurrence Is.Full.Investigation
##  Bronx        :49442     Mode :logical        
##  Brooklyn     :72215     FALSE:107084         
##  Manhattan    :42104     TRUE :97313          
##  Outside NYC  :  170                          
##  Queens       :30883                          
##  Staten Island: 9100                          
##  NA's         :  483                          
##  Complaint.Has.Video.Evidence             Complaint.Filed.Mode
##  Mode :logical                Call Processing System: 42447   
##  FALSE:195530                 E-mail                :   799   
##  TRUE :8867                   Fax                   :   356   
##                               In-person             :  9586   
##                               Mail                  :  3424   
##                               On-line website       : 14197   
##                               Phone                 :133588   
##        Complaint.Filed.Place Complaint.Contains.Stop...Frisk.Allegations
##  CCRB             :130877    Mode :logical                              
##  IAB              : 69214    FALSE:119856                               
##  Precinct         :  3548    TRUE :84541                                
##  Other City agency:   295                                               
##  Mayor's Office   :   157                                               
##  Other            :   110                                               
##  (Other)          :   196                                               
##             Incident.Location  Incident.Year             Encounter.Outcome
##  Street/highway      :123274   Min.   :1999   Arrest              :89139  
##  Apartment/house     : 34720   1st Qu.:2007   No Arrest or Summons:82964  
##  Residential building: 12421   Median :2009   Other/NA            : 1050  
##  Police building     :  8968   Mean   :2010   Summons             :31244  
##  Subway station/train:  6077   3rd Qu.:2012                               
##  (Other)             : 15581   Max.   :2016                               
##  NA's                :  3356                                              
##                                 Reason.For.Initial.Contact
##  PD suspected C/V of violation/crime - street:60107       
##  Other                                       :39030       
##  PD suspected C/V of violation/crime - bldg  :16067       
##  PD suspected C/V of violation/crime - auto  :12953       
##  Moving violation                            : 8843       
##  (Other)                                     :66542       
##  NA's                                        :  855       
##          Allegation.FADO.Type
##  Abuse of Authority:102173   
##  Discourtesy       : 34452   
##  Force             : 61761   
##  Offensive Language:  6008   
##  NA's              :     3   
##                              
##                              
##                            Allegation.Description
##  Physical force                       :44116     
##  Word                                 :31704     
##  Stop                                 :12944     
##  Search (of person)                   :12250     
##  Refusal to provide name/shield number:10359     
##  (Other)                              :93021     
##  NA's                                 :    3
ggplot(data, aes(x=data$Received.Year, fill= data$Allegation.FADO.Type)) + geom_histogram(stat = "count") + labs (title = "Number of Complaints Received Each Year", x="Received Year", y="Number of Complaints") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Allegation Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

  1. Frequency of Incident Occurrence by Borough and Type
ggplot(data, aes(x=data$Borough.of.Occurrence, fill= data$Allegation.FADO.Type)) + geom_histogram(stat = "count") + labs (title = "Frequency of Incident Occurence by Borough and Type", x="Borough of Occurence", y="Frequency of Occurrence") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Allegation Type") 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

  1. Investigation by Evidence
ggplot(data, aes(x=data$Is.Full.Investigation, fill= data$Complaint.Has.Video.Evidence)) + geom_bar(stat = "count") + labs (title = "Investigation by Evidence", x="Is Full Investigation", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Has Video Evidence")

  1. Number of Incident Occurred Each Year by Evidence
ggplot(data, aes(x=data$Incident.Year, fill= data$Complaint.Has.Video.Evidence)) + geom_histogram(stat = "count") + labs (title = "Number of Incident Occurred Each Year by Evidence", x="Incident Year", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Has Video Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

  1. Number of Cases Closed Each Year by Investigation
ggplot(data, aes(x=data$Close.Year, fill= data$Is.Full.Investigation)) + geom_histogram(stat = "count") + labs (title = "Number of Case Closed Each Year by Investigation", x="Close Year", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Is Full Investigation")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

  1. Number of Incidents Occurred Each Year by Outcome
ggplot(data, aes(x=data$Incident.Year, fill= data$Encounter.Outcome)) + geom_histogram(stat = "count") + labs (title = "Number of Incidents Occurred Each Year by Outcome", x="Incident Year", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Encountered Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

  1. Encounter Outcome by Borough
ggplot(data, aes(x=data$Encounter.Outcome, fill= data$Borough.of.Occurrence)) + geom_bar(stat = "count") + labs (title = "Encounter Outcome by Borough", x="Encounter Outcome", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Borough of Occurrence")

  1. Relationship between Incident Year and Received Year
ggplot(data, aes(x=data$Incident.Year, y= data$Received.Year)) + geom_point() + geom_smooth(method = lm) + labs (title = "Relationship between Incident Year and Received Year", x="Incident Year", y="Received Year")

  1. Relationship between Close Year and Received Year
ggplot(data, aes(x=data$Received.Year, y= data$Close.Year)) + geom_point() + geom_smooth(method = lm) + labs (title = "Relationship between Close Year and Received Year", x="Received Year", y="Close Year") 

  1. Number of Complains by Allegation Type
ggplot(data, aes(x=data$Allegation.FADO.Type, fill= data$Allegation.FADO.Type)) + geom_bar(stat = "count") + labs (title = "Number of Complains by Allegation Type", x="Type", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Type")

## It helps us understanding the role of data visualization and how raw data can be used to understand different scenarios of any case study