The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:
Your final document should include at minimum 10 visualization. Each should include a brief statement of what they show.
A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.3.3
library(readxl)
## Warning: package 'readxl' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- read_excel("D:/Harrisburg/512-50/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
df <- data.frame(df)
str(df)
## 'data.frame': 204397 obs. of 16 variables:
## $ DateStamp : POSIXct, format: "2016-11-29" "2016-11-29" ...
## $ UniqueComplaintId : num 11 18 18 18 18 18 18 18 18 18 ...
## $ Close.Year : num 2006 2006 2006 2006 2006 ...
## $ Received.Year : num 2005 2004 2004 2004 2004 ...
## $ Borough.of.Occurrence : chr "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
## $ Is.Full.Investigation : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ Complaint.Has.Video.Evidence : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Complaint.Filed.Mode : chr "On-line website" "Phone" "Phone" "Phone" ...
## $ Complaint.Filed.Place : chr "CCRB" "CCRB" "CCRB" "CCRB" ...
## $ Complaint.Contains.Stop...Frisk.Allegations: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Incident.Location : chr "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
## $ Incident.Year : num 2005 2004 2004 2004 2004 ...
## $ Encounter.Outcome : chr "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
## $ Reason.For.Initial.Contact : chr "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
## $ Allegation.FADO.Type : chr "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
## $ Allegation.Description : chr "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...
df.by.receiveyear <- df %>%
group_by(Received.Year) %>%
summarize(num_case = n_distinct(UniqueComplaintId)) %>%
select(Received.Year, num_case)
ggplot(data = df.by.receiveyear, aes(x = Received.Year, y = num_case)) +
geom_line(alpha = 0.5) +
ggtitle('Figure 1: Number of Complaints by Received Year') +
xlab('Received Year') +
ylab('Number of Cases') +
theme_economist()
From Figure 1, we can find the decreasing trend of number of complaints in recent years. The peak is around 2005-2006. It implied the complaints situation may be improving now.
df.by.closeyear <- df %>%
group_by(Close.Year) %>%
summarize(num_case = n_distinct(UniqueComplaintId)) %>%
select(Close.Year, num_case)
ggplot(data = df.by.closeyear, aes(x = Close.Year, y = num_case)) +
geom_line(alpha = 0.5) +
ggtitle('Figure 2: Number of Complaints by Close Year') +
xlab('Close Year') +
ylab('Number of Cases') +
theme_economist()
From Figure 2, number of complaints by close year show general downward trend, however, there is back and forth in recent year as well.
df.dif <- df %>%
distinct(UniqueComplaintId, .keep_all = TRUE) %>%
mutate(time_length = Close.Year - Received.Year)
## Warning: package 'bindrcpp' was built under R version 3.3.3
ggplot(data = df.dif, aes(x = time_length)) +
geom_bar(width = 0.5, alpha = 0.5) +
labs(title = 'Figure 3: Time Length for Complaints to Be Processed', x = 'Time Length (Years)') +
theme_economist()
From Figure 3, we can see that the responding time for complaint process is not that long. For majority of cases, it is closed within a year or in 1-2 years.
df.dif2 <- df %>%
distinct(UniqueComplaintId, .keep_all = TRUE) %>%
mutate(time_length = Received.Year - Incident.Year) %>%
select(time_length)
ggplot(data = df.dif2, aes(x = time_length)) +
geom_bar(width = 0.5, alpha = 0.5) +
labs(title = 'Figure 4: Time Length for Complaints to Be Received', x = 'Time Length (Years)') +
theme_economist()
From Figure 4, we can find that the majority cases are received within a year when the indicent is happened. It is a good sign that people tend to report the incident right away.
df.dif3 <- df %>%
distinct(UniqueComplaintId, .keep_all = TRUE) %>%
mutate(time_length = Close.Year - Incident.Year) %>%
select(time_length)
ggplot(data = df.dif3, aes(x = time_length)) +
geom_bar(width = 0.5, alpha = 0.5) +
labs(title = 'Figure 5: Time Length for Complaints to Be Closed from Indicent Year', x = 'Time Length (Years)') +
theme_economist()
From Figure 5, we can find that the majority cases are also closed in a timely manner.
ggplot(data = df, aes(x = Borough.of.Occurrence, fill = Is.Full.Investigation )) +
geom_bar(width = 0.5, alpha = 0.5, stat = 'count') +
labs(title = 'Figure 6: Geography Location for Complaint and Investigation Situation', x = 'Location') +
scale_fill_discrete(name = 'Full Investigation or Not') +
theme_economist()
From Figure 6, we can find that the investigation does not differ much because different geography location, which is a good sign. However, in each location, there are many complaints which did not triger the investigation. Maybe there is still improvement opportunity.
ggplot(data = df, aes(x = Is.Full.Investigation, fill = Complaint.Has.Video.Evidence)) +
geom_bar(stat = 'count', alpha = 0.5) +
labs(title = 'Figure 7: Investigation and Voice Message Joint Distribution', x = 'Full Investigation or NOt') +
scale_fill_discrete(name = 'Video Evidence or Not') +
theme_economist()
From Figure 7, we found that the respond situation for complaint is not good. Majority of complaints are not treated seriously, no matter whether video evidence is included. It is a sign for improvement. However, it seemed that the respond probability is a little higher if video eveidence is included. Hence it suggested people putting more evidence when making complaints.
ggplot(data = df, aes(x = Complaint.Filed.Mode)) +
geom_bar(stat = 'count') +
labs(title = 'Figure 8: Complaint Method Summary', x = 'Method Type') +
theme_economist()
From Figure 8, we find that traditional method such as fax or mail is not that highly used today. Phone, call, and website are more prevalent.
ggplot(data = df, aes(x = Encounter.Outcome)) +
geom_bar(stat = 'count', alpha = 0.5) +
labs(title='Figure 9: Indicent Outcome Summary') +
theme_economist()
From Figure 9, we find that one of the common result is arrest. However, there is still lots of cases with no arrest or summon decision.Further analysis is conducted to see the outcome result and investigation result relationship.
ggplot(data = df, aes(x = Encounter.Outcome, fill = Is.Full.Investigation)) +
geom_bar(stat = 'count', alpha = 0.5) +
labs(title='Figure 10: Indicent Outcome and Full Investigation Relationship') +
scale_fill_discrete(name = 'Full Investigation or Not') +
theme_economist()
From Figure 10, we can find that the majority arrest decision is through full investigation. However, there is still a large number of cases where arrest is made without full investigation or no arrest/summon is made without full investigation. There is still improvement opportunity.