The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:
# This is a top section
## This is a subsection
ccrb <- read.csv("/Users/rohitmishra/Desktop/Harrisburg University/ANLY 512/ccrb.csv")
head(ccrb)
## DateStamp UniqueComplaintId Close.Year Received.Year
## 1 11/29/2016 11 2006 2005
## 2 11/29/2016 18 2006 2004
## 3 11/29/2016 18 2006 2004
## 4 11/29/2016 18 2006 2004
## 5 11/29/2016 18 2006 2004
## 6 11/29/2016 18 2006 2004
## Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
## 1 Manhattan FALSE FALSE
## 2 Brooklyn TRUE FALSE
## 3 Brooklyn TRUE FALSE
## 4 Brooklyn TRUE FALSE
## 5 Brooklyn TRUE FALSE
## 6 Brooklyn TRUE FALSE
## Complaint.Filed.Mode Complaint.Filed.Place
## 1 On-line website CCRB
## 2 Phone CCRB
## 3 Phone CCRB
## 4 Phone CCRB
## 5 Phone CCRB
## 6 Phone CCRB
## Complaint.Contains.Stop...Frisk.Allegations Incident.Location
## 1 FALSE Street/highway
## 2 FALSE Street/highway
## 3 FALSE Street/highway
## 4 FALSE Street/highway
## 5 FALSE Street/highway
## 6 FALSE Street/highway
## Incident.Year Encounter.Outcome
## 1 2005 No Arrest or Summons
## 2 2004 Arrest
## 3 2004 Arrest
## 4 2004 Arrest
## 5 2004 Arrest
## 6 2004 Arrest
## Reason.For.Initial.Contact Allegation.FADO.Type
## 1 Other Abuse of Authority
## 2 PD suspected C/V of violation/crime - street Abuse of Authority
## 3 PD suspected C/V of violation/crime - street Discourtesy
## 4 PD suspected C/V of violation/crime - street Discourtesy
## 5 PD suspected C/V of violation/crime - street Discourtesy
## 6 PD suspected C/V of violation/crime - street Force
## Allegation.Description
## 1 Threat of arrest
## 2 Refusal to obtain medical treatment
## 3 Word
## 4 Word
## 5 Word
## 6 Physical force
ccrb <- unique(ccrb)
Brooklyn has the maximum number of incidents outside of Newyork filed to CCRB.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tidyr' was built under R version 3.4.1
## Warning: package 'purrr' was built under R version 3.4.1
## Warning: package 'dplyr' was built under R version 3.4.1
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggplot2)
library(forcats)
library(dplyr)
ggplot(ccrb, aes(x = fct_infreq(Borough.of.Occurrence))) +
geom_bar() + xlab("Borough.of.Occurrence")
We can find the decreasing trend of number of complaints in recent years. The peak is around 2005-2006 and we see that it has been decreasing.
library(ggplot2)
library(ggthemes)
library(readxl)
library(dplyr)
df.by.receiveyear <- ccrb %>%
group_by(Received.Year) %>%
summarize(num_case = n_distinct(UniqueComplaintId)) %>%
select(Received.Year, num_case)
ggplot(data = df.by.receiveyear, aes(x = Received.Year, y = num_case)) +
geom_line(alpha = 0.5) +
ggtitle('Figure 1: Number of Complaints by Received Year') +
xlab('Received Year') +
ylab('Number of Cases') +
theme_economist()
The way a complaint is filed shows that most people prefer registering complaints via the phone.
ccrb %>% mutate(Complaint.Filed.Mode = Complaint.Filed.Mode %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(Complaint.Filed.Mode)) +
geom_bar() + coord_flip() + xlab("Complaint.Filed.Mode")
We see that most outcomes have no arrests but lead to summons
ggplot(ccrb, aes(x = fct_infreq(Encounter.Outcome))) +
geom_bar() + xlab("Encounter.Outcome")
For the length of complaint processing, we can see that the responding time for complaint processing is not that long. For majority of the cases, it is closed within a year or between 1-2 years.
df.dif <- ccrb %>%
distinct(UniqueComplaintId, .keep_all = TRUE) %>%
mutate(time_length = Close.Year - Received.Year)
ggplot(data = df.dif, aes(x = time_length)) +
geom_bar(width = 0.5, alpha = 0.5) +
labs(title = 'Figure 3: Time Length for Complaints to Be Processed', x = 'Time Length (Years)') +
theme_economist()
Incident Outcome: We can find that the majority of the arrest decisions is through full investigation. However, there are still a large number of cases where arrests are made without full investigation or no arrest/summon is made without full investigation.
ggplot(data = ccrb , aes(x = Encounter.Outcome, fill = Is.Full.Investigation)) +
geom_bar(stat = 'count', alpha = 0.5) +
labs(title='Figure 10: Indicent Outcome and Full Investigation Relationship') +
scale_fill_discrete(name = 'Full Investigation or Not') +
theme_economist()
order<- data.frame(sort(table(ccrb$Reason.For.Initial.Contact),decreasing = TRUE))
ggplot(order[1:10,],aes(Var1,Freq))+geom_point()+coord_flip()
#visualization8: When each allegation type is split by whether Full Investigation is done or not then the proportion of true vs. false is quite close
ggplot(data = ccrb, aes(x = Allegation.FADO.Type, fill = Is.Full.Investigation)) +
geom_bar(stat = 'count') +
labs(title='Allegation Type vs. Full Investigation Relationship') +
scale_fill_discrete(name = 'Full Investigation or Not') + geom_text(stat='count',aes(label=..count..))
Incident Location:When the incident location is arranged in descending order of frequency counts, we see that the most common location for incidents is the street/highway
ggplot(ccrb, aes(x = fct_infreq(Incident.Location))) +
geom_bar() + xlab("Incident.Location") + coord_flip()
#Visualization 10:
library(vcd)
## Loading required package: grid
ccrb1 <- na.omit(ccrb)
x <- ccrb1[,c(13,15)]
assoc(~ Allegation.FADO.Type + Encounter.Outcome, data = x, shade = TRUE)
Exploratory Data Analysis (EDA) is of great importance to summarize the basic relationship among varibles. It is helpful because it can help find the questions that are already answered and questions that still needed to be digged into. For instance, in this dataset, through EDA, we can find that the responding time is quite efficient, because the waiting time for cases to be closed or received is low. However, the investigation visualization result implied that there is room to analyze why the investigation rate is low.