Directions

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Import Library

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5
library(ggthemes)

Read Data

library(readxl)
## Warning: package 'readxl' was built under R version 3.2.5
data1 <- read_excel("/Users/Yihan/Downloads/ccrb_datatransparencyinitiative.xlsx", 
    sheet = "Complaints_Allegations")

summary(data1)
##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                        NA's :0               NA's :0                     
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location 
##  Mode :logical                               Length:204397     
##  FALSE:119856                                Class :character  
##  TRUE :84541                                 Mode  :character  
##  NA's :0                                                       
##                                                                
##                                                                
##  Incident Year  Encounter Outcome  Reason For Initial Contact
##  Min.   :1999   Length:204397      Length:204397             
##  1st Qu.:2007   Class :character   Class :character          
##  Median :2009   Mode  :character   Mode  :character          
##  Mean   :2010                                                
##  3rd Qu.:2012                                                
##  Max.   :2016                                                
##  Allegation FADO Type Allegation Description
##  Length:204397        Length:204397         
##  Class :character     Class :character      
##  Mode  :character     Mode  :character      
##                                             
##                                             
## 

vis1: Illustration by Incident Year

Vis1<- unique(data1[c("UniqueComplaintId","Incident Year")])
Vis1<- data.frame(Vis1)
ggplot(Vis1,aes(Incident.Year)) +  geom_bar() +  ggtitle('Graph 1: Cases by Incidnet Year') +   xlab('Incident Year') +   ylab('Number of Cases')

Vis2: Illustration by Received Year

Vis2<- unique(data1[c("UniqueComplaintId","Received Year")])
Vis2<- data.frame(Vis2)
ggplot(Vis2,aes(Received.Year)) +  geom_bar(position="dodge") +  ggtitle('Graph 2: Cases by Received Year') +   xlab('Received Year') +   ylab('Number of Cases')

Vis3: Illustration by Received Borough

Vis3<- unique(data1[c("UniqueComplaintId","Borough of Occurrence")])
Vis3<- data.frame(Vis3)
ggplot(Vis3,aes(Borough.of.Occurrence)) +  geom_histogram(stat = "count") +  ggtitle('Graph 3: Cases by Borough') +   xlab('Borough') + ylab('Number of Cases') 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Vis4:

Vis4<- unique(data1[c("UniqueComplaintId","Is Full Investigation")])
Vis4<- data.frame(Vis4)
ggplot(Vis4,aes(Is.Full.Investigation)) +  geom_bar(position="dodge") +  ggtitle('Graph 4: Cases by investigation') +   xlab('Investigation(O= not Complete, 1= Complete') +   ylab('Number of Cases')

Vis5

Vis5<- unique(data1[c("UniqueComplaintId","Complaint Has Video Evidence")])
Vis5<- data.frame(Vis5)
ggplot(Vis5,aes(Complaint.Has.Video.Evidence)) +  geom_bar(position="dodge") +  ggtitle('Graph 5: Cases having video evidence') +   xlab('Video Evidence(O= No, 1 = Yes)') +   ylab('Number of Cases')

Vis6

Vis6<- unique(data1[c("UniqueComplaintId","Complaint Filed Place")])
Vis6<- data.frame(Vis6)
ggplot(Vis6,aes(Complaint.Filed.Place)) +  geom_bar(position="dodge") +  ggtitle('Graph 6: Cases by Complaints filed location') +   xlab('Complaints filed location') +   ylab('Number of Cases')+coord_flip()

Vis7

Vis7<- unique(data1[c("UniqueComplaintId","Complaint Filed Place")])
Vis7<- data.frame(Vis7)
ggplot(Vis7,aes(Complaint.Filed.Place)) +  geom_bar(position="dodge") +  ggtitle('Graph 7: Cases by Complaints filed location') +   xlab('Complaints filed location') +   ylab('Number of Cases')+coord_flip()

Vis8

Vis8<- unique(data1[c("UniqueComplaintId","Incident Location")])
Vis8<- data.frame(Vis8)
ggplot(Vis8,aes(Incident.Location)) +  geom_bar(position="dodge") +  ggtitle('Graph 8: Cases by location') +   xlab('Incident Location') +   ylab('Number of Cases')+coord_flip()

Vis9

Vis9<- unique(data1[c("UniqueComplaintId","Encounter Outcome")])
Vis9<- data.frame(Vis9)
ggplot(Vis9,aes(Encounter.Outcome)) +  geom_bar(position="dodge") +  ggtitle('Graph 9: Cases by Encounter Outcome') +   xlab('Encounter Outcome') +   ylab('Number of Cases')

Vis10

Vis10<- unique(data1[c("UniqueComplaintId","Allegation FADO Type")])
Vis10<- data.frame(Vis10)
ggplot(Vis10,aes(Allegation.FADO.Type)) +  geom_bar(position="dodge") +  ggtitle('Graph 10: Cases by Allegation FADO Types') +   xlab('Allegation FADO Types') +   ylab('Number of Cases')

Summary: The first graph shows that the highest amount of crimes occurred in 2006-2007 with a decreasing trend in recent years. Most of the encounter outcome is no arrest/no summons and most of the allegations are from Abuse of Authority. Brooklyn is the Borough has the largest amount of cases. As most crimes are related to abuse of authority and force, video evidence for such crimes should be increased in order to bring the level of crimes down. Similarly, although more crimes are being fully investigated over previous years, this level can be improved.