The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
library(readxl)
library(ggplot2)
library(ggthemes)
ccrb <- read_excel("Data/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
summary(ccrb)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location
## Mode :logical Length:204397
## FALSE:119856 Class :character
## TRUE :84541 Mode :character
##
##
##
## Incident Year Encounter Outcome Reason For Initial Contact
## Min. :1999 Length:204397 Length:204397
## 1st Qu.:2007 Class :character Class :character
## Median :2009 Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Allegation FADO Type Allegation Description
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
ccrb_vis1 <- ggplot(data = ccrb,
aes(x=ccrb$`Incident Year`,fill=ccrb$`Complaint Filed Mode`)) +
geom_bar(position = "fill") +
scale_fill_discrete(name="Complaint Filed Mode") +
labs(x="Incident Year", y="Proportion of Incident (%)", title="Different modes of complaints being filed during 1999-2016") +
xlim(1999,2016)
ccrb_vis1
– Methods of filing complaints have been changed over time. Fax was the main source of communication in the early days while the use of on-line website has been increasing. However, the majority of complaints was done through the use of phone and call processing systems combined.
ccrb_vis2 <- ggplot(data = ccrb,
aes(x=ccrb$`Incident Year`,fill=ccrb$`Borough of Occurrence`)) +
geom_bar() +
scale_fill_discrete(name="Borough of Occurence") +
labs(x="Incident Year", y="Number of Incidents", title="No. of incidents in each borough during 1999-2016") +
xlim(1999,2016)
ccrb_vis2
– Since 2007, the overall number of incidents had been decreasing across different boroughs. However, looking only at this graph, it is still a bit unclear to pinpoint between Bronx and Brookly regarding the borough with the most incidents.
ccrb_vis3 <- ggplot(data = ccrb,
aes(x=ccrb$`Borough of Occurrence`)) +
geom_bar() +
labs(x="Borough of Occurence", y="Number of Incidents", title="Incidents by Borough")
ccrb_vis3
– Now we could see that Brooklyn was the borough with the highest incidents occuring, followed by Bronx and Manhattan.
ccrb_vis4 <- ggplot(data = ccrb,
aes(x=ccrb$`Borough of Occurrence`,fill=ccrb$`Incident Location`)) +
geom_bar(position = "fill") +
scale_fill_discrete(name="Incident Location") +
labs(x="Borough of Occurence", y="Proportion of Incident (%)", title="Proportion of Incident Locations in Each Borough")
ccrb_vis4
– The majority of the allegation events occured on street/highway across different boroughs. Manhattan had a higher share of incidents took place in subway station/train than others. Also, we could see a larger share of incidents happened in apartment/house in locations outside NYC.
ccrb_vis5 <- ggplot(data = ccrb,
aes(x=ccrb$`Incident Year`, y=ccrb$`Received Year`))+
geom_point() +
geom_smooth() +
labs(x="Incident Year", y="Received Year", title="Relationship between Incident Year and Received Year")
ccrb_vis5
## `geom_smooth()` using method = 'gam'
– It seemed like the Incident Year and Received Year established a linear relationship, meaning that the incidents were reported in the same year of occurences.
ccrb_vis6 <- ggplot(data = ccrb,
aes(x=ccrb$`Received Year`, y=ccrb$`Close Year`))+
geom_point() +
geom_smooth() +
labs(x="Received Year", y="Close Year", title="Relationship between Received Year and Close Year")
ccrb_vis6
## `geom_smooth()` using method = 'gam'
– Received Year and Close Year, however, did not seems to establish a clear linear relationship until after 2005. There were cases that incidents reported before 2005 but could not get closed within the same year. The situation had been improved since 2005 where we could see the linear trend as the base line. However, we could also see that there were quite a number of cases that took more than a year to get closed, meaning that they might take longer investigation time to reach conclusion.
ccrb_vis7 <- ggplot(data = ccrb,
aes(x=ccrb$`Borough of Occurrence`,fill=ccrb$`Encounter Outcome`)) +
geom_bar(position = "fill") +
scale_fill_discrete(name="Encounter Outcome") +
labs(x="Borough of Occurence", y="Proportion of Incident (%)", title="Proportion of Encounter Outcome in Each Borough")
ccrb_vis7
– In Bronx, Brookly and Staten Island, the likelihood of encounters ended up with being arrested was highter than other boroughs. Whereas outside NYC, no arrest or summons represented the majority of the encounter outcomes.
ccrb_vis8 <- ggplot(data = ccrb,
aes(x=ccrb$`Encounter Outcome`,fill=ccrb$`Is Full Investigation`)) +
geom_bar() +
scale_fill_discrete(name="Full Investigation [T/F]") +
labs(x="Encounter Outcome", y="Number of Incidents", title="Encounter Outcome and Presence of Full Investigation")
ccrb_vis8
– Arrest represented the highest encounter outcome, followed by no arrest or summons. We could see that for more serious outcome such as arrest or summons when charges needed to be clearly stated/identified, approximately 50% or more of those incidents involved full investigation. However, for incidents with no arrest or summons, the use of full investigation had been less.
ccrb_vis9 <- ggplot(data = ccrb,
aes(x=ccrb$`Encounter Outcome`,fill=ccrb$`Complaint Has Video Evidence`)) +
geom_bar() +
scale_fill_discrete(name="Video Evidence [T/F]") +
labs(x="Encounter Outcome", y="Number of Incidents", title="Encounter Outcome and Presence of Video Evidence")
ccrb_vis9
– Most of cases, regardless of the encounter outcomes, had video evidence when filing complaints.
ccrb_vis9 <- ggplot(data = ccrb,
aes(x=ccrb$`Encounter Outcome`,fill=ccrb$`Allegation FADO Type`)) +
geom_bar(position = "fill") +
scale_fill_discrete(name="Allegation FADO Type") +
labs(x="Encounter Outcome", y="Proportion of Incident (%)", title="Encounter Outcome and Allegation FADO Type")
ccrb_vis9
– For incidents ended up with arrest outcome, force represented the majority of allegation FADO type. However, for cases with no arrest or summons or just summons, abuse of authority was the main theme.
– When performing Exploratory Data Analysis (EDA) on the CCRB data, we are able to analyze several different aspects of the complaints allegations. we could see the trend of the incidents, the borough and locations where the incidents occured, and the presence of evidence being used or practiced being done for each type of encounter outcomes. All these could lead to more specific questions that worth time and efforts for further analysis.
– EDA is a useful concept that could be used with data that we are not familiar with. It allows us to explore what is in the data and let the data communicates the stories based on the questions that we would like to know. Through graphic visualization, we are able to understand the data, see the answers of what we are looking for or ask additional questions and perform further analysis.