The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5
library(ggthemes)
Read Data
library(readxl)
## Warning: package 'readxl' was built under R version 3.2.5
data1 <- read_excel("/Users/Yihan/Downloads/ccrb_datatransparencyinitiative.xlsx",
sheet = "Complaints_Allegations")
summary(data1)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
## NA's :0 NA's :0
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location
## Mode :logical Length:204397
## FALSE:119856 Class :character
## TRUE :84541 Mode :character
## NA's :0
##
##
## Incident Year Encounter Outcome Reason For Initial Contact
## Min. :1999 Length:204397 Length:204397
## 1st Qu.:2007 Class :character Class :character
## Median :2009 Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Allegation FADO Type Allegation Description
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
vis1: Illustration by Incident Year
Vis1<- unique(data1[c("UniqueComplaintId","Incident Year")])
Vis1<- data.frame(Vis1)
ggplot(Vis1,aes(Incident.Year)) + geom_bar() + ggtitle('Graph 1: Cases by Incidnet Year') + xlab('Incident Year') + ylab('Number of Cases')
Vis2: Illustration by Received Year
Vis2<- unique(data1[c("UniqueComplaintId","Received Year")])
Vis2<- data.frame(Vis2)
ggplot(Vis2,aes(Received.Year)) + geom_bar(position="dodge") + ggtitle('Graph 2: Cases by Received Year') + xlab('Received Year') + ylab('Number of Cases')
Vis3: Illustration by Received Borough
Vis3<- unique(data1[c("UniqueComplaintId","Borough of Occurrence")])
Vis3<- data.frame(Vis3)
ggplot(Vis3,aes(Borough.of.Occurrence)) + geom_histogram(stat = "count") + ggtitle('Graph 3: Cases by Borough') + xlab('Borough') + ylab('Number of Cases')
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Vis4:
Vis4<- unique(data1[c("UniqueComplaintId","Is Full Investigation")])
Vis4<- data.frame(Vis4)
ggplot(Vis4,aes(Is.Full.Investigation)) + geom_bar(position="dodge") + ggtitle('Graph 4: Cases by investigation') + xlab('Investigation(O= not Complete, 1= Complete') + ylab('Number of Cases')
Vis5
Vis5<- unique(data1[c("UniqueComplaintId","Complaint Has Video Evidence")])
Vis5<- data.frame(Vis5)
ggplot(Vis5,aes(Complaint.Has.Video.Evidence)) + geom_bar(position="dodge") + ggtitle('Graph 5: Cases having video evidence') + xlab('Video Evidence(O= No, 1 = Yes)') + ylab('Number of Cases')
Vis6
Vis6<- unique(data1[c("UniqueComplaintId","Complaint Filed Place")])
Vis6<- data.frame(Vis6)
ggplot(Vis6,aes(Complaint.Filed.Place)) + geom_bar(position="dodge") + ggtitle('Graph 6: Cases by Complaints filed location') + xlab('Complaints filed location') + ylab('Number of Cases')+coord_flip()
Vis7
Vis7<- unique(data1[c("UniqueComplaintId","Complaint Filed Place")])
Vis7<- data.frame(Vis7)
ggplot(Vis7,aes(Complaint.Filed.Place)) + geom_bar(position="dodge") + ggtitle('Graph 7: Cases by Complaints filed location') + xlab('Complaints filed location') + ylab('Number of Cases')+coord_flip()
Vis8
Vis8<- unique(data1[c("UniqueComplaintId","Incident Location")])
Vis8<- data.frame(Vis8)
ggplot(Vis8,aes(Incident.Location)) + geom_bar(position="dodge") + ggtitle('Graph 8: Cases by location') + xlab('Incident Location') + ylab('Number of Cases')+coord_flip()
Vis9
Vis9<- unique(data1[c("UniqueComplaintId","Encounter Outcome")])
Vis9<- data.frame(Vis9)
ggplot(Vis9,aes(Encounter.Outcome)) + geom_bar(position="dodge") + ggtitle('Graph 9: Cases by Encounter Outcome') + xlab('Encounter Outcome') + ylab('Number of Cases')
Vis10
Vis10<- unique(data1[c("UniqueComplaintId","Allegation FADO Type")])
Vis10<- data.frame(Vis10)
ggplot(Vis10,aes(Allegation.FADO.Type)) + geom_bar(position="dodge") + ggtitle('Graph 10: Cases by Allegation FADO Types') + xlab('Allegation FADO Types') + ylab('Number of Cases')
Summary: The first graph shows that the highest amount of crimes occurred in 2006-2007 with a decreasing trend in recent years. Most of the encounter outcome is no arrest/no summons and most of the allegations are from Abuse of Authority. Brooklyn is the Borough has the largest amount of cases. As most crimes are related to abuse of authority and force, video evidence for such crimes should be increased in order to bring the level of crimes down. Similarly, although more crimes are being fully investigated over previous years, this level can be improved.