Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.

# This is a top section
## This is a subsection

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

Data Import

library(ggplot2)
library(readxl)
data <- read_xlsx("/Users/sabrina/Desktop/HU Course/512 Visualization/ccrb_datatransparencyinitiative.xlsx", sheet=2)

Vis 1 - Incident Year

to see how many incidents happened in every years

hist(data$"Incident Year", main="Histogram: Freq for each Incident Year", xlab="Incident Year", border="black", breaks = 15, col="blue")

Vis 2 - Borough

to see the occurrence ratio in New York different areas

Borough <- table(data$"Borough of Occurrence")
pie(Borough, radius = 1, col = c("brown1", "lightskyblue", "seagreen2","tan", "orchid", "lightpink","gray"))

#Vis 3 - Incident Year & Borough

to see if the occurrence ratio in different areas change in every year

YearBorough<- unique(data[c("UniqueComplaintId","Incident Year","Borough of Occurrence")])
YearBorough<- data.frame(YearBorough)
ggplot(YearBorough,aes(Incident.Year,fill=Borough.of.Occurrence))+geom_bar()

Vis 4 - Encounter outcome of complaints for each incident year

to know the 4 kinds of encounter outcome increase or decrease year by year

EncounterOutcome<- unique(data[c("UniqueComplaintId","Incident Year","Encounter Outcome")])
EncounterOutcome<- data.frame(EncounterOutcome)
ggplot(EncounterOutcome,aes(Incident.Year,fill=Encounter.Outcome))+geom_bar()

Vis 5 - Encounter outcome of complaints by Borough

to know the relationship between this 4 kinds of encounter outcome and New York different boroughs

OutcomebyBorough<- unique(data[c("UniqueComplaintId","Borough of Occurrence","Encounter Outcome")])
OutcomebyBorough<- data.frame(OutcomebyBorough)
ggplot(OutcomebyBorough,aes(Encounter.Outcome,fill=Borough.of.Occurrence))+geom_bar()

Vis 6 - Allegation Type count

What is the most type of Allegation in FADO?

AllegationType<- unique(data[c("UniqueComplaintId","Allegation FADO Type")])
AllegationType<- data.frame(AllegationType)
ggplot(AllegationType,aes(Allegation.FADO.Type,fill=Allegation.FADO.Type))+geom_bar()

Vis 7 - Frequency of Incident Occurrence by Borough and Type

to know the incident relationship between FADO Type and New York different boroughs

BoroughType<- unique(data[c("UniqueComplaintId","Allegation FADO Type","Borough of Occurrence")])
BoroughType<- data.frame(BoroughType)
ggplot(BoroughType,aes(Borough.of.Occurrence,fill=Allegation.FADO.Type))+geom_bar()

Vis 8 - Incident Location & Borough

What’s the most likely location that incident happened? (Street/Highway!then Apartment/House)

Location<- unique(data[c("UniqueComplaintId","Incident Location")])
Location<- data.frame(Location)
ggplot(Location,aes(Incident.Location,fill=Incident.Location))+geom_bar()

We’ve known the incidents are most likely happened on Street/Highway or Apartment/House, then is this rule applied to New York different borough? (Roughly yes)

ggplot(data, aes(x=data$`Borough of Occurrence`, fill=data$`Incident Location`))+   
  geom_histogram(stat = "count") + 
  labs(title="Incident.Location", x="Borough.of.Occurrence", y="Counts")+ 
  scale_fill_discrete(name="Incident Location") +  theme(legend.position = "bottom")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Vis 9 - Complaint Filed Mode

How did people file the complaints? (Phone is the easiest way to go.)

FiledMode <- table(data$"Complaint Filed Mode")
pie(FiledMode, radius = 1, col = c("brown1", "yellow", "seagreen2","tan", "orchid", "lightpink","lightskyblue"))

Vis 10 - Relationship between Incident Year & Received Year

How long was the gap between the year of incident happened and the year of its received?

ggplot(data, aes(x=data$`Incident Year`, y=data$`Received Year`)) + geom_point(shape=17) + geom_smooth(method = lm, se=FALSE, color="orange") + labs (title = "Relationship between Incident Year & Received Year", x="Incident Year", y="Received Year")

Summary

Exploratory data analysis (EDA) is a statistical approach to analyzing data without making any assumptions about its contents. It was developed by John Tukey in the 1970s. Through this technique, I can summarize the data and understand it in a better and quicker way, then figure out what questions I want to ask and how to frame them, also how best to manipulate the available data sources to get the answers I look for. EDA is also important for eliminating or sharpening potential hypotheses about the big picture that can be addressed by the data.

This dataset is from Civilian Complain Review Board (CCRB) about complaints and incidents. EDA helps me to identify its interesting patterns and trends within the data, for example, the the frequency of incident rose significantly after 2005, but after 2015, it decreased less than the frequency before 2005. And occurrence ratio happened in Brooklyn the most, then Bronx and Manhanttan are about the same. And Street/Highway is the most popular location that the incidents happened.

This assignment makes me realize again how powerful of data visualization it could be. Using ggplot, it brings all the information and insights and to pop up the graphs in R studio in a few seconds. Nice and clean! It definitely saved us lots of time to understand the data than looking into the raw data.

ANLY 512 - Problem Set 4

Exploratory Data Analysis

NingYin Yang

12/12/2017