Objectives

The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).

For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment, you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

# This is a top section

## This is a subsection

Your final document should include at minimum 10 visualizations. Each should include a brief statement of why you made the graphic.

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

library(ggplot2)
library(readxl)
mydata= read_excel("C:/Users/yy/Desktop/ccrb_datatransparencyinitiative.xlsx", sheet = 2)

The first plot we want to see the pattern of complaint received in each year, so that we can visualize the trend, and see which year has the maxium compaint in the trend

vis1= table(mydata$`Received Year`)
barplot(vis1, xlab="Complaint Received Year", col="skyblue")

In the 2nd plot, we want to know the percentage of received complaints in different areas and its ratio

Vis2= table(mydata$'Borough of Occurrence')
pie(Vis2, radius = 1,col = rainbow(12))

In the 3rd plot, we want to see what the complaints modes, see in what way people would like to prefer file during the complaints

vis3= table(mydata$`Complaint Filed Mode`)
barplot(vis3, xlab="Complaint Filed Mode", col="red")

After seeing this single variable data, we want to do the multivariate data analysis, so we want to plot the data with combined those variables. In each area, we can see the ratio of complaint modes.

vis4=unique(mydata[c("Incident Year","Borough of Occurrence","Complaint Filed Mode")])
vis4=data.frame(vis4)
ggplot(vis4,aes(Borough.of.Occurrence,fill=Complaint.Filed.Mode))+geom_bar()

In this plot, we want to see the outcome ratio in different years

vis5=unique(mydata[c("Incident Year","Borough of Occurrence","Encounter Outcome")])
vis5=data.frame(vis5)
ggplot(vis5,aes(Incident.Year,fill=Encounter.Outcome))+geom_bar()

In the plot6, we want to see in boxplot for each areas and in every year

vis6=ggplot(mydata, aes(y=`Incident Year`,x=`Borough of Occurrence`)) + geom_boxplot()
vis6

In the 7th plot, we want to see the relation between incidents and locations.

vis7=ggplot(mydata, aes(x=`Incident Location`, fill=`Incident Location`)) + geom_bar() + labs(title="Incidents in different Locations", x="Incident Location") + scale_fill_discrete(name="Incident Location") + coord_flip()
vis7

In the plot8, we want to find out the correlation between incident year and received year, and based on data make the regression fit.

vis8=ggplot(mydata, aes(x=`Incident Year`, y=`Received Year`)) +geom_point() + geom_smooth(method = lm) + labs(title="Relationship between Incident Year and Recieved Year", x="Incident Year", y="Recieved Year")
vis8
## `geom_smooth()` using formula 'y ~ x'

in the plot 9, we want to see statistic data of the incident locations during the years

vis9=ggplot(mydata, aes(y=`Incident Year`,x=`Incident Location`, color=cyl)) + geom_boxplot(aes(colour = 'red'))+coord_flip()
vis9

In the final plot 10, we want to check the correlation between evidence and investigation results.

vis10=ggplot(mydata, aes(x=`Is Full Investigation`, fill=`Complaint Has Video Evidence`)) + geom_bar() + labs(title="Correlation between Investigation and Video Evidence")
vis10

Data Summary

Before we start analyzing data and test any hypotheses, we need to find out the relationship between the different variables in the data set, and extract out some important variables to do the further analysis. First we need to gather the data, and to check data to make sure data quality. Then, the plotting and reviewing data help us to further understand and analyze the data correlation between the variables, which enble us to identify valuble factors. Besides that, by plotting data in graphs, it is very straightforward for us to understand the properties and relationship between various variables. Exploratory Data Analysis is often performed with a representative sample of the data. EDA helps in understanding and summarizing the data without making bias assumption.