Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

# This is a top section

## This is a subsection

Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

Exchange the posiions of two sheets in original excel file then read into R.

library(ggplot2)
library(readxl)
ccrb_data <- read_excel("/Users/KD/Dropbox/HU/ANLY 512/Assignment/ccrb_datatransparencyinitiative.xlsx")
df = data.frame(ccrb_data)

Viz 1 Summary of distribution of complaints in different locations using Pie Chart.

pie(table(df$Borough.of.Occurrence))

Most of complaints reported in Brokklyn. Bronx and Manhattan are almost the same. Staten Island has the least complaints.

Viz 2 Summary of “full investigation or not”.

ggplot(df, aes(x = Borough.of.Occurrence, fill = Is.Full.Investigation)) + geom_bar(stat = 'count') + labs(title = "Full Investigation True or False", x = "Location", Y = "Count") + theme_classic()

From the chart above we can see that almost half cases are full investigated.

Viz 3 Summary of “Complaints have video evidence or not”.

ggplot(df, aes(x = Borough.of.Occurrence, fill = Complaint.Has.Video.Evidence)) + geom_bar(stat = 'count') + labs(title = "Complaints have video evidence or not", x = "Location", Y = "Count") + theme_classic()

Most of the complaints don’t have video evidence.

Viz 4 Relation between “Complaints have video evidence or not” and “Full investigation or not”

ggplot(df, aes(x = Complaint.Has.Video.Evidence, fill = Is.Full.Investigation)) + geom_bar(stat = 'count') + labs(title = "Video evidence and Investigation", x = "Video Evidence", Y = "Count") + theme_classic()

With video evidence, almost all cases are full investigated. Without video evidence, only half of cases are full investigated.

Viz 5 Summary of “Complaints Filed Mode”.

barplot(table(df$Complaint.Filed.Mode))

Most of complaints are filed by phone, the second one is Call Processing System.

Viz 6 Summary of “Complaint Filed Place”.

ggplot(df, aes(x = Borough.of.Occurrence, fill = Complaint.Filed.Place)) + geom_histogram(stat = 'count') + labs(title = "Complaint Filed Place", x = "Location", Y = "Count") + theme_classic()

In different regions have the similar situation. Most complaints are filed at CCRB, the second one is IAB.

Viz 7 Summary of “Incident Location”.

ggplot(df, aes(x = Incident.Location, fill = Incident.Location)) + geom_histogram(stat = 'count') + labs(title = "Incident Location", x = "Location", Y = "Count") + theme_classic()

Most of incidents happended on street/highway, the second one is on bus.

Viz 8 Relation between “Encounter Outcome” and “Full investigation or not”.

ggplot(df, aes(x = Is.Full.Investigation, fill = Encounter.Outcome)) + geom_bar(stat = 'count') + labs(title = "Outcome and Investigation", x = "Full investigation or not", Y = "Count") + theme_classic()

With full investigation, arrest and summons are over 50% among all cases. Without full investigation, over half cases are No Arrest or Summons.

Viz 9 Summary of cases received year.

ggplot(df, aes(x = Received.Year, fill = Received.Year)) + geom_histogram(stat = 'count', stat_bin = 30) + labs(title = "Cases Received Year", x = "Received Year", Y = "Count") + theme_classic()

Compaints are reducing per year.

Viz 10 Summary of cases close year.

ggplot(df, aes(x = Close.Year, fill = Close.Year)) + geom_histogram(stat = 'count', stat_bin = 30) + labs(title = "Cases Close Year", x = "Close Year", Y = "Count") + theme_classic()

Almost all cases are closed.

Summary

Exploratory data analysis is very useful on data sets we are not familiar with. We don’t need to do anything on the original data set. We can check the relationship between any variables as we need. The outcome is quite clear and easy understand. There are a lot of charts we can choose as we need. Choose the right one is also important. Exploratory data analysis is a method help us understand the data set directly. It is not just about the graphics, but also about data collection and data cleaning. Single variable analysis is not hard, however, I think find out the relationship between variables are more important. And this is what EDA does.