The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.3.3
ccrb <- read.csv("G:\\ccrb.csv")
To have an idea of the number of received cases for each year, we will use the bar chart below.
ggplot(ccrb, aes(x=Received.Year)) + geom_bar(stat = "count") + labs(title="Complains Received Each Year", x="Received Year", y="Number of Complains")
To have an idea of the ranking of different types of complains, we will use the bar chart below.
ggplot(ccrb, aes(x=Allegation.FADO.Type, fill=Allegation.FADO.Type)) + geom_bar(stat = "count") + labs(title="Number of Complain by Allegation Type", x="Type", y="Number") + theme(legend.position = "bottom") + scale_fill_discrete(name="Type")
To have an idea of different modes to file complains, we will use the bar chart below.
ggplot(ccrb, aes(x=Complaint.Filed.Mode
, fill=Complaint.Filed.Mode)) + geom_bar(stat = "count") + labs(title="Number of Complain by Filed Mode", x="Mode", y="Number") + theme(legend.position = "bottom") + scale_fill_discrete(name="Mode")
To compare the incident year and the close year of the complains, we will use the graph below.
ggplot(ccrb, aes(x=Incident.Year, y=Close.Year)) + geom_point() + geom_smooth(method = lm) + labs(title="Incident Year vs Close Year", x="Incident Year", y="Close Year")
To have an idea of the outcome of the complains, we will use the graph below.
ggplot(ccrb, aes(x=Incident.Year, fill=Encounter.Outcome)) + geom_histogram(stat = "count") + labs(title="Number of Incident Occurred Each Year Divided by Outcome", x="Incident Year", y="Number") + scale_fill_discrete(name="Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
To have an idea of the compaint filed mode of each Borough, we will use the graph below.
ggplot(ccrb, aes(x=Borough.of.Occurrence, fill=Complaint.Filed.Mode)) +
geom_bar(stat = "count") + labs(title="Borough of Occurrence by Filed Mode", x="Borough of Occurrence", y="Number") + scale_fill_discrete(name="Complaint Filed Mode")
To have an idea of the number of incidents occured that have different outcomes, we will use the graph below.
ggplot(ccrb, aes(x=Close.Year, fill=Encounter.Outcome)) + geom_histogram(stat = "count") + labs(title="Number of Incidents Closed Each Year by Outcome", x="Close Year", y="Number") + scale_fill_discrete(name="Encounter Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
To have an idea of the number of cases that have been fully investigated, we will use the graph below.
ggplot(ccrb, aes(x=Incident.Year, fill=Is.Full.Investigation)) + geom_bar(stat = "count") + labs(title="Complaints with Fully Investigation", x="Incident Year", y="Number") + scale_fill_discrete(name="Fully Investigated")
To have an idea of the number of cases that have Video Evidence, we will use the graph below.
ggplot(ccrb, aes(x=Incident.Year, fill=Complaint.Has.Video.Evidence)) + geom_bar(stat = "count") + labs(title="Complaints with Video Evidence", x="Incident Year", y="Number") + scale_fill_discrete(name="Has Video Evidence")
To have an idea of the number of cases that have been fully investigated when closed, we will use the graph below.
ggplot(ccrb, aes(x=Close.Year, fill=Is.Full.Investigation)) + geom_histogram(stat = "count") + labs(title="No. of Cases Closed Each Year by Investigation", x="Close Year", y="Number") + scale_fill_discrete(name="Fully Investigated")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
In this EDA, I have learned that:
The trend of filing complaints was rising and then falling. The higheast point was in 2007, the second year after the filing begins, and the number was decreasing year by year.
The ranking of Allegation type was Abuse of Authority, Force, Discourtesy, and Offensive Language. However, this does not mean that Offensive Language occurred the least, because the victims of Offensive Language might file the complaints least.
The complaints were filed mostly by Phone and Calling system.
Only less than 20% of the cases could be closed in the same year as occured.Most cases were closed within 5 years.
Over one third of the cases resulted in arresting. About one third of the cases did not result in arresting or summoning.
Brooklyn was the place that had most filed complaints. And it was also the place with the most various modes to file cases.
The cases that were fully investigated were the most in 2006. And the in 2005 and 2006, most cases were fully investigated. As the number of the filed complaints increased, the percentage of the number of cases that were fully investigated acctually decreased.
Most complaints filed did not have vedio evidences.