The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.
# This is a top section
## This is a subsection
A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.
library(ggplot2)
library(readxl)
data <- read_xlsx("/Users/sabrina/Desktop/HU Course/512 Visualization/ccrb_datatransparencyinitiative.xlsx", sheet=2)
to see how many incidents happened in every years
hist(data$"Incident Year", main="Histogram: Freq for each Incident Year", xlab="Incident Year", border="black", breaks = 15, col="blue")
to see the occurrence ratio in New York different areas
Borough <- table(data$"Borough of Occurrence")
pie(Borough, radius = 1, col = c("brown1", "lightskyblue", "seagreen2","tan", "orchid", "lightpink","gray"))
#Vis 3 - Incident Year & Borough
to see if the occurrence ratio in different areas change in every year
YearBorough<- unique(data[c("UniqueComplaintId","Incident Year","Borough of Occurrence")])
YearBorough<- data.frame(YearBorough)
ggplot(YearBorough,aes(Incident.Year,fill=Borough.of.Occurrence))+geom_bar()
to know the 4 kinds of encounter outcome increase or decrease year by year
EncounterOutcome<- unique(data[c("UniqueComplaintId","Incident Year","Encounter Outcome")])
EncounterOutcome<- data.frame(EncounterOutcome)
ggplot(EncounterOutcome,aes(Incident.Year,fill=Encounter.Outcome))+geom_bar()
to know the relationship between this 4 kinds of encounter outcome and New York different boroughs
OutcomebyBorough<- unique(data[c("UniqueComplaintId","Borough of Occurrence","Encounter Outcome")])
OutcomebyBorough<- data.frame(OutcomebyBorough)
ggplot(OutcomebyBorough,aes(Encounter.Outcome,fill=Borough.of.Occurrence))+geom_bar()
What is the most type of Allegation in FADO?
AllegationType<- unique(data[c("UniqueComplaintId","Allegation FADO Type")])
AllegationType<- data.frame(AllegationType)
ggplot(AllegationType,aes(Allegation.FADO.Type,fill=Allegation.FADO.Type))+geom_bar()
to know the incident relationship between FADO Type and New York different boroughs
BoroughType<- unique(data[c("UniqueComplaintId","Allegation FADO Type","Borough of Occurrence")])
BoroughType<- data.frame(BoroughType)
ggplot(BoroughType,aes(Borough.of.Occurrence,fill=Allegation.FADO.Type))+geom_bar()
Location<- unique(data[c("UniqueComplaintId","Incident Location")])
Location<- data.frame(Location)
ggplot(Location,aes(Incident.Location,fill=Incident.Location))+geom_bar()
ggplot(data, aes(x=data$`Borough of Occurrence`, fill=data$`Incident Location`))+
geom_histogram(stat = "count") +
labs(title="Incident.Location", x="Borough.of.Occurrence", y="Counts")+
scale_fill_discrete(name="Incident Location") + theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
How did people file the complaints? (Phone is the easiest way to go.)
FiledMode <- table(data$"Complaint Filed Mode")
pie(FiledMode, radius = 1, col = c("brown1", "yellow", "seagreen2","tan", "orchid", "lightpink","lightskyblue"))
How long was the gap between the year of incident happened and the year of its received?
ggplot(data, aes(x=data$`Incident Year`, y=data$`Received Year`)) + geom_point(shape=17) + geom_smooth(method = lm, se=FALSE, color="orange") + labs (title = "Relationship between Incident Year & Received Year", x="Incident Year", y="Received Year")
Exploratory data analysis (EDA) is a statistical approach to analyzing data without making any assumptions about its contents. It was developed by John Tukey in the 1970s. Through this technique, I can summarize the data and understand it in a better and quicker way, then figure out what questions I want to ask and how to frame them, also how best to manipulate the available data sources to get the answers I look for. EDA is also important for eliminating or sharpening potential hypotheses about the big picture that can be addressed by the data.
This dataset is from Civilian Complain Review Board (CCRB) about complaints and incidents. EDA helps me to identify its interesting patterns and trends within the data, for example, the the frequency of incident rose significantly after 2005, but after 2015, it decreased less than the frequency before 2005. And occurrence ratio happened in Brooklyn the most, then Bronx and Manhanttan are about the same. And Street/Highway is the most popular location that the incidents happened.
This assignment makes me realize again how powerful of data visualization it could be. Using ggplot, it brings all the information and insights and to pop up the graphs in R studio in a few seconds. Nice and clean! It definitely saved us lots of time to understand the data than looking into the raw data.