The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.
A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.
library(ggplot2)
data_CCRB <- read.csv(file="C:/Users/Calmth of Life/Dropbox/Harrisburg Semesters/ANLY 512/Problem Set 4/ccrb_datatransparencyinitiative.csv")
This graphic is a bar chart. This graphic shows us the number of cases received each year.
ggplot(data_CCRB, aes(x=Received.Year)) + geom_bar(stat = "count") + labs(title="Complains Received Each Year", x="Received Year", y="Number of Complains")
This graphic is a bar chart. This graphic shows us the different types of complaints.
ggplot(data_CCRB, aes(x=Allegation.FADO.Type, fill=Allegation.FADO.Type)) + geom_bar(stat = "count") + labs(title="Number of Complain by Allegation Type", x="Type", y="Number") + theme(legend.position = "bottom") + scale_fill_discrete(name="Type")
This graphic is a stacked bar chart. This graphic shows us the number of cases fully closed by investigation.
ggplot(data_CCRB, aes(x=Close.Year, fill=Is.Full.Investigation)) + geom_histogram(stat = "count") + labs(title="No. of Cases Closed Each Year by Investigation", x="Close Year", y="Number") + scale_fill_discrete(name="Fully Investigated")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This graphic is a stacked bar chart. This graphic shows us the number of incidents fully closed by investigation having different outcome.
ggplot(data_CCRB, aes(x=Close.Year, fill=Encounter.Outcome)) + geom_histogram(stat = "count") + labs(title="Number of Incidents Closed Each Year by Outcome", x="Close Year", y="Number") + scale_fill_discrete(name="Encounter Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This graphic is a stacked bar chart. This graphic shows us the number of cases fully investigated.
ggplot(data_CCRB, aes(x=Incident.Year, fill=Is.Full.Investigation)) + geom_bar(stat = "count") + labs(title="Complaints with Fully Investigation", x="Incident Year", y="Number") + scale_fill_discrete(name="Fully Investigated")
This graphic is a stacked bar chart. This graphic shows us the number of complaints.
ggplot(data_CCRB, aes(x=Incident.Year, fill=Encounter.Outcome)) + geom_histogram(stat = "count") + labs(title="Number of Incident Occurred Each Year Divided by Outcome", x="Incident Year", y="Number") + scale_fill_discrete(name="Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This graphic is a stacked bar chart. This graphic shows us the number of cases that have video evidences.
ggplot(data_CCRB, aes(x=Incident.Year, fill=Complaint.Has.Video.Evidence)) + geom_bar(stat = "count") + labs(title="Complaints with Video Evidence", x="Incident Year", y="Number") + scale_fill_discrete(name="Has Video Evidence")
This graphic shows us the year in which the incident happened to the year in which the case was closed.
ggplot(data_CCRB, aes(x=Incident.Year, y=Close.Year)) + geom_point() + geom_smooth(method = lm) + labs(title="Incident Year vs Close Year", x="Incident Year", y="Close Year")
This graphic gives us the idea of the compaint filed mode of each Borough.
ggplot(data_CCRB, aes(x=Borough.of.Occurrence, fill=Complaint.Filed.Mode)) + geom_bar(stat = "count") + labs(title="Borough of Occurrence by Filed Mode", x="Borough of Occurrence", y="Number") + scale_fill_discrete(name="Complaint Filed Mode")
This graphic shows us the different modes to file a complaint.
ggplot(data_CCRB, aes(x=Complaint.Filed.Mode, fill=Complaint.Filed.Mode)) + geom_bar(stat = "count") + labs(title="Number of Complain by Filed Mode", x="Mode", y="Number") + theme(legend.position = "bottom") + scale_fill_discrete(name="Mode")
Exploratory Data Analysis is very helpful in understanding the distribution and trend of underlying data. Using the vizualization techniques with CCRB data we can easily understand many things that otherwise seem to be hidden in the sea of data. Comparing number of cases filed in a year and closed in that year gives us an idea about how much time on an average it takes to conclude the complaints.We can also explore which year or location was receiving more/less complaints and further we can understand whether the more number of complaints can be attributed to more crimes or strict policing. This exercise was very helpful in terms of data exploration and R tool exploration.