Objectives

The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).

For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

library(ggplot2)
library(ggthemes)

library(readxl)
data1 <- read_excel("/Users/RunhaoWang/Desktop/512 O/ccrb_datatransparencyinitiative.xlsx",sheet = "Complaints_Allegations")

summary(data1)
##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location  Incident Year 
##  Mode :logical                               Length:204397      Min.   :1999  
##  FALSE:119856                                Class :character   1st Qu.:2007  
##  TRUE :84541                                 Mode  :character   Median :2009  
##                                                                 Mean   :2010  
##                                                                 3rd Qu.:2012  
##                                                                 Max.   :2016  
##  Encounter Outcome  Reason For Initial Contact Allegation FADO Type
##  Length:204397      Length:204397              Length:204397       
##  Class :character   Class :character           Class :character    
##  Mode  :character   Mode  :character           Mode  :character    
##                                                                    
##                                                                    
##                                                                    
##  Allegation Description
##  Length:204397         
##  Class :character      
##  Mode  :character      
##                        
##                        
## 
str(data1)
## tibble [204,397 × 16] (S3: tbl_df/tbl/data.frame)
##  $ DateStamp                                  : POSIXct[1:204397], format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num [1:204397] 11 18 18 18 18 18 18 18 18 18 ...
##  $ Close Year                                 : num [1:204397] 2006 2006 2006 2006 2006 ...
##  $ Received Year                              : num [1:204397] 2005 2004 2004 2004 2004 ...
##  $ Borough of Occurrence                      : chr [1:204397] "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is Full Investigation                      : logi [1:204397] FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint Has Video Evidence               : logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint Filed Mode                       : chr [1:204397] "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint Filed Place                      : chr [1:204397] "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint Contains Stop & Frisk Allegations: logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident Location                          : chr [1:204397] "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident Year                              : num [1:204397] 2005 2004 2004 2004 2004 ...
##  $ Encounter Outcome                          : chr [1:204397] "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason For Initial Contact                 : chr [1:204397] "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation FADO Type                       : chr [1:204397] "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation Description                     : chr [1:204397] "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...

Deliverable and Grades

For this assignment, you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

Visualization 1

Vis1<- unique(data1[c("UniqueComplaintId","Allegation FADO Type")])
Vis1<- data.frame(Vis1)
ggplot(Vis1,aes(Allegation.FADO.Type)) +  geom_bar(position="dodge") +  ggtitle('Figure 1: Number of cases by Allegation FADO Types') +   xlab('Allegation FADO Types') +   ylab('Number of Cases') 

Observation: It can be seen that most of the allegations are coming from Abuse of Authority.

Visualization 2

Vis2<- unique(data1[c("UniqueComplaintId","Encounter Outcome")])
Vis2<- data.frame(Vis2)
ggplot(Vis2,aes(Encounter.Outcome)) +  geom_bar(position="dodge") +  ggtitle('Figure 2: Number of Cases by Encounter Outcome') +   xlab('Encounter Outcome') +   ylab('Number of Cases')

Observation: It can be seen that most of the encounter outcome is no arrest/no summons.

Visualization 3

Vis3<- unique(data1[c("UniqueComplaintId","Incident Year")])
Vis3<- data.frame(Vis3)
ggplot(Vis3,aes(Incident.Year)) +  geom_bar() +  ggtitle('Figure 3: Number of Cases by Incidnet Year') +   xlab('Incident Year') +   ylab('Number of Cases')+coord_flip()

Observation: It can be seen that peak was during the years 2005 to 2010 and then it follows a gradually reduces

Visualization 4

Vis4<- unique(data1[c("UniqueComplaintId","Complaint Filed Place")])
Vis4<- data.frame(Vis4)
ggplot(Vis4,aes(Complaint.Filed.Place)) +  geom_bar(position="dodge") +  ggtitle('Figure 4: Number of cases by Complaints filed location') +   xlab('Complaints filed location') +   ylab('Number of Cases')+coord_flip()

Observation: It can be seen that most of the filed location is from CCRB AND IAB.

Visualization 5

Vis5<- unique(data1[c("UniqueComplaintId","Complaint Filed Mode")])
Vis5<- data.frame(Vis5)
ggplot(Vis5,aes(Complaint.Filed.Mode)) +  geom_bar(position="dodge") +  ggtitle('Figure 5: Number of Cases by complaint filed mode') +   xlab('complaint filed mode') +   ylab('Number of Cases')

Observation: It can be seen that most of the allegations are coming from phone and Call processing system email.

Visualization 6

Vis6<- unique(data1[c("UniqueComplaintId","Complaint Has Video Evidence")])
Vis6<- data.frame(Vis6)
ggplot(Vis6,aes(Complaint.Has.Video.Evidence)) +  geom_bar(position="dodge") +  ggtitle('Figure 6: Number of cases having video evidence') +   xlab('Video Evidence(O= No, 1 = Yes)') +   ylab('Number of Cases')

Observation: It can be seen that most of the cases dont have video evidence.

Visualization 7

Vis7<- unique(data1[c("UniqueComplaintId","Borough of Occurrence")])
Vis7<- data.frame(Vis7)
ggplot(Vis7,aes(Borough.of.Occurrence)) +  geom_bar(position="dodge") +  ggtitle('Figure 7: Number of cases by location') +   xlab('location') +   ylab('Number of Cases')+coord_flip()

Observation: It can be seen that most of the cases are from Brooklyn.

Visualization 8

Vis8<- unique(data1[c("UniqueComplaintId","Received Year")])
Vis8<- data.frame(Vis8)
ggplot(Vis8,aes(Received.Year)) +  geom_bar(position="dodge") +  ggtitle('Figure 8: Number of Complaints by Received Year') +   xlab('Received Year') +   ylab('Number of Cases')

Observation: It can be seen that from 2010 the number of cases received have shown a decreasing pattern.

Visualization 9

Vis9<- unique(data1[c("UniqueComplaintId","Is Full Investigation")])
Vis9<- data.frame(Vis9)
ggplot(Vis9,aes(Is.Full.Investigation)) +  geom_bar(position="dodge") +  ggtitle('Figure 9: Number of Cases by investigation') +   xlab('Investigation(O= not Complete, 1= Complete') +   ylab('Number of Cases')

Observation: It can be seen that more then 2/3 of the cases are still under investigation.

Visualization 10

Vis10<- unique(data1[c("UniqueComplaintId","Incident Location")])
Vis10<- data.frame(Vis10)
ggplot(Vis10,aes(Incident.Location)) +  geom_bar(position="dodge") +  ggtitle('Figure 10: Number of Cases by location') +   xlab('Incident Location') +   ylab('Number of Cases')+coord_flip()

Observation: It can be seen that most of the cases are coming from street/Highway and Apartment/house.

Summary

Some Conclusions: 1: It can be seen that most of the allegations are coming from Abuse of Authority. Observation 2: It can be seen that most of the encounter outcome is no arrest/no summons. Observation 3: It can be seen that peak was during the years 2005 to 2010 and then it follows a gradually reduces Observation 4: It can be seen that most of the filed location is from CCRB AND IAB. Observation 5: It can be seen that most of the allegations are coming from phone and Call processing system email. Observation 6: It can be seen that most of the cases dont have video evidence. Observation 7: It can be seen that most of the cases are from Brooklyn. Observation 8: It can be seen that from 2010 the number of cases received have shown a decreasing pattern. Observation 9: It can be seen that more then 2/3 of the cases are still under investigation. Observation 10: It can be seen that most of the cases are coming from street/Highway and Apartment/house.