The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
library(ggplot2)
library(ggthemes)
library(readxl)
data1 <- read_excel("/Users/RunhaoWang/Desktop/512 O/ccrb_datatransparencyinitiative.xlsx",sheet = "Complaints_Allegations")
summary(data1)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location Incident Year
## Mode :logical Length:204397 Min. :1999
## FALSE:119856 Class :character 1st Qu.:2007
## TRUE :84541 Mode :character Median :2009
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Encounter Outcome Reason For Initial Contact Allegation FADO Type
## Length:204397 Length:204397 Length:204397
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Allegation Description
## Length:204397
## Class :character
## Mode :character
##
##
##
str(data1)
## tibble [204,397 × 16] (S3: tbl_df/tbl/data.frame)
## $ DateStamp : POSIXct[1:204397], format: "2016-11-29" "2016-11-29" ...
## $ UniqueComplaintId : num [1:204397] 11 18 18 18 18 18 18 18 18 18 ...
## $ Close Year : num [1:204397] 2006 2006 2006 2006 2006 ...
## $ Received Year : num [1:204397] 2005 2004 2004 2004 2004 ...
## $ Borough of Occurrence : chr [1:204397] "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
## $ Is Full Investigation : logi [1:204397] FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ Complaint Has Video Evidence : logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Complaint Filed Mode : chr [1:204397] "On-line website" "Phone" "Phone" "Phone" ...
## $ Complaint Filed Place : chr [1:204397] "CCRB" "CCRB" "CCRB" "CCRB" ...
## $ Complaint Contains Stop & Frisk Allegations: logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Incident Location : chr [1:204397] "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
## $ Incident Year : num [1:204397] 2005 2004 2004 2004 2004 ...
## $ Encounter Outcome : chr [1:204397] "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
## $ Reason For Initial Contact : chr [1:204397] "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
## $ Allegation FADO Type : chr [1:204397] "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
## $ Allegation Description : chr [1:204397] "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...
For this assignment, you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
Vis1<- unique(data1[c("UniqueComplaintId","Allegation FADO Type")])
Vis1<- data.frame(Vis1)
ggplot(Vis1,aes(Allegation.FADO.Type)) + geom_bar(position="dodge") + ggtitle('Figure 1: Number of cases by Allegation FADO Types') + xlab('Allegation FADO Types') + ylab('Number of Cases')
Observation: It can be seen that most of the allegations are coming from Abuse of Authority.
Vis2<- unique(data1[c("UniqueComplaintId","Encounter Outcome")])
Vis2<- data.frame(Vis2)
ggplot(Vis2,aes(Encounter.Outcome)) + geom_bar(position="dodge") + ggtitle('Figure 2: Number of Cases by Encounter Outcome') + xlab('Encounter Outcome') + ylab('Number of Cases')
Observation: It can be seen that most of the encounter outcome is no arrest/no summons.
Vis3<- unique(data1[c("UniqueComplaintId","Incident Year")])
Vis3<- data.frame(Vis3)
ggplot(Vis3,aes(Incident.Year)) + geom_bar() + ggtitle('Figure 3: Number of Cases by Incidnet Year') + xlab('Incident Year') + ylab('Number of Cases')+coord_flip()
Observation: It can be seen that peak was during the years 2005 to 2010 and then it follows a gradually reduces
Vis4<- unique(data1[c("UniqueComplaintId","Complaint Filed Place")])
Vis4<- data.frame(Vis4)
ggplot(Vis4,aes(Complaint.Filed.Place)) + geom_bar(position="dodge") + ggtitle('Figure 4: Number of cases by Complaints filed location') + xlab('Complaints filed location') + ylab('Number of Cases')+coord_flip()
Observation: It can be seen that most of the filed location is from CCRB AND IAB.
Vis5<- unique(data1[c("UniqueComplaintId","Complaint Filed Mode")])
Vis5<- data.frame(Vis5)
ggplot(Vis5,aes(Complaint.Filed.Mode)) + geom_bar(position="dodge") + ggtitle('Figure 5: Number of Cases by complaint filed mode') + xlab('complaint filed mode') + ylab('Number of Cases')
Observation: It can be seen that most of the allegations are coming from phone and Call processing system email.
Vis6<- unique(data1[c("UniqueComplaintId","Complaint Has Video Evidence")])
Vis6<- data.frame(Vis6)
ggplot(Vis6,aes(Complaint.Has.Video.Evidence)) + geom_bar(position="dodge") + ggtitle('Figure 6: Number of cases having video evidence') + xlab('Video Evidence(O= No, 1 = Yes)') + ylab('Number of Cases')
Observation: It can be seen that most of the cases dont have video evidence.
Vis7<- unique(data1[c("UniqueComplaintId","Borough of Occurrence")])
Vis7<- data.frame(Vis7)
ggplot(Vis7,aes(Borough.of.Occurrence)) + geom_bar(position="dodge") + ggtitle('Figure 7: Number of cases by location') + xlab('location') + ylab('Number of Cases')+coord_flip()
Observation: It can be seen that most of the cases are from Brooklyn.
Vis8<- unique(data1[c("UniqueComplaintId","Received Year")])
Vis8<- data.frame(Vis8)
ggplot(Vis8,aes(Received.Year)) + geom_bar(position="dodge") + ggtitle('Figure 8: Number of Complaints by Received Year') + xlab('Received Year') + ylab('Number of Cases')
Observation: It can be seen that from 2010 the number of cases received have shown a decreasing pattern.
Vis9<- unique(data1[c("UniqueComplaintId","Is Full Investigation")])
Vis9<- data.frame(Vis9)
ggplot(Vis9,aes(Is.Full.Investigation)) + geom_bar(position="dodge") + ggtitle('Figure 9: Number of Cases by investigation') + xlab('Investigation(O= not Complete, 1= Complete') + ylab('Number of Cases')
Observation: It can be seen that more then 2/3 of the cases are still under investigation.
Vis10<- unique(data1[c("UniqueComplaintId","Incident Location")])
Vis10<- data.frame(Vis10)
ggplot(Vis10,aes(Incident.Location)) + geom_bar(position="dodge") + ggtitle('Figure 10: Number of Cases by location') + xlab('Incident Location') + ylab('Number of Cases')+coord_flip()
Observation: It can be seen that most of the cases are coming from street/Highway and Apartment/house.
Some Conclusions: 1: It can be seen that most of the allegations are coming from Abuse of Authority. Observation 2: It can be seen that most of the encounter outcome is no arrest/no summons. Observation 3: It can be seen that peak was during the years 2005 to 2010 and then it follows a gradually reduces Observation 4: It can be seen that most of the filed location is from CCRB AND IAB. Observation 5: It can be seen that most of the allegations are coming from phone and Call processing system email. Observation 6: It can be seen that most of the cases dont have video evidence. Observation 7: It can be seen that most of the cases are from Brooklyn. Observation 8: It can be seen that from 2010 the number of cases received have shown a decreasing pattern. Observation 9: It can be seen that more then 2/3 of the cases are still under investigation. Observation 10: It can be seen that most of the cases are coming from street/Highway and Apartment/house.