The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment, you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
library(readxl)
ccrb <- read_excel("C:/Users/mruga/Desktop/ccrb.xlsx")
summary(ccrb)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location Incident Year
## Mode :logical Length:204397 Min. :1999
## FALSE:119856 Class :character 1st Qu.:2007
## TRUE :84541 Mode :character Median :2009
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Encounter Outcome Reason For Initial Contact Allegation FADO Type
## Length:204397 Length:204397 Length:204397
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Allegation Description
## Length:204397
## Class :character
## Mode :character
##
##
##
str(ccrb)
## tibble [204,397 x 16] (S3: tbl_df/tbl/data.frame)
## $ DateStamp : POSIXct[1:204397], format: "2016-11-29" "2016-11-29" ...
## $ UniqueComplaintId : num [1:204397] 11 18 18 18 18 18 18 18 18 18 ...
## $ Close Year : num [1:204397] 2006 2006 2006 2006 2006 ...
## $ Received Year : num [1:204397] 2005 2004 2004 2004 2004 ...
## $ Borough of Occurrence : chr [1:204397] "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
## $ Is Full Investigation : logi [1:204397] FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ Complaint Has Video Evidence : logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Complaint Filed Mode : chr [1:204397] "On-line website" "Phone" "Phone" "Phone" ...
## $ Complaint Filed Place : chr [1:204397] "CCRB" "CCRB" "CCRB" "CCRB" ...
## $ Complaint Contains Stop & Frisk Allegations: logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Incident Location : chr [1:204397] "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
## $ Incident Year : num [1:204397] 2005 2004 2004 2004 2004 ...
## $ Encounter Outcome : chr [1:204397] "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
## $ Reason For Initial Contact : chr [1:204397] "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
## $ Allegation FADO Type : chr [1:204397] "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
## $ Allegation Description : chr [1:204397] "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...
dim(ccrb)
## [1] 204397 16
ggplot(ccrb, aes(x=`Received Year`, fill= `Allegation FADO Type`)) +
geom_histogram(stat = "count") +
labs (title = "Number of Complaints Received Each Year", x="Received Year", y="Number of Complaints") +
scale_fill_discrete(name = "Allegation Type")
borough = table(ccrb$`Borough of Occurrence`)
lbls = names(borough)
barplot(borough,
xlab = "Borough of Occurrence",
ylab = "Number",
main = "Borough of Occurrence in CCRB Report",
horiz = FALSE,
legend.text = TRUE,
col=rainbow(length(lbls)))
ggplot(ccrb, aes(`Complaint Filed Mode`, fill=`Complaint Filed Mode`)) + geom_histogram(stat = "count") +
labs(title="Mode of complaints filed", x="Mode", y="Total") + scale_fill_discrete(name="Complaint Filed Mode")
ggplot(ccrb,aes(x = `Received Year`, fill = `Complaint Has Video Evidence`)) +
geom_bar(stat = "Count") +
labs (title = "Number of complaints that video evidence")
scatter = ggplot(ccrb, aes(`Received Year`, `Close Year`))
scatter + geom_point() +
geom_smooth(method = 'lm', color = 'red') +
xlab('Complaints Received Year') +
ylab('Complaints Closed Year') +
ggtitle('Complaints Receiving and Closing Year')
ggplot(ccrb, aes(`Incident Location`)) +
geom_histogram(stat = "count") +
labs(title="Number of Complaints by Incident Location", x="Incident Location", y="Total") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(ccrb,aes(x = `Encounter Outcome`,
fill = `Borough of Occurrence`)) +
geom_bar(stat = "count") +
labs(title = "Encounter Outcomes by borough")
ggplot(ccrb, aes(x=`Received Year`, fill= `Encounter Outcome`)) +
geom_histogram(stat = "count") +
labs (title = "Ouctomes Each Year", x="Received Year", y="Number of Complaints") +
scale_fill_discrete(name = "Outcome Type")
ggplot(ccrb, aes(x = `Is Full Investigation`, fill = ccrb$`Encounter Outcome` )) +
geom_bar(stat = 'count') +
labs(title='Investigation vs Outcome') +
scale_fill_discrete(name = 'Encounter Outcome')
encounter = table(ccrb$`Encounter Outcome`)
lbls <- names(encounter)
percent <- round(encounter/sum(encounter)*100)
lbls <- paste(lbls, percent) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
plot1 = pie(encounter,labels = lbls,main="Encounter Outcome by %")
We are given the Civilian Complain Review Board data for NYC. A quick glance shows that there are 204397 records and 16 variables. Data is recorded accross 7 boroughs including NA. The exploratory data analysis has revealed a few things. 1) Brooklyn sees the most incidents 2) Phone is by far the most common mode of communication for complaints 3) 44% of the incidents end in arrests which is higher than all other outcomes. 4) A full investigation has higher rates of arrest. 5) Complaints with incidents on streets/highways are the highest.