For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.
A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.
library(readxl)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggthemes)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
hw4 <- read_excel("C:/Users/Olivia/Desktop/HU/Data Visualization/ccrb_datatransparencyinitiative.xlsx",
sheet = "Complaints_Allegations")
deduped.data <- unique( hw4[ , 1:5 ] )
deduped.data %>%
ggplot(aes(x = `Received Year`, fill = `Borough of Occurrence`)) +
geom_bar(position = "stack") + theme_economist() +
labs(title = "Complaints counts by Borough of Occurrence across years")
uniinccident <-unique(hw4[ , 1:15 ] )
validlocation=uniinccident[!grepl("NA", uniinccident$`Borough of Occurrence`),]
after2005 <- validlocation[validlocation$`Incident Year` >= "2006" & validlocation$`Incident Year` <= "2016",]
ggplot(after2005, aes(x=after2005$`Borough of Occurrence`, fill= after2005$`Incident Location`)) + geom_bar(position="fill",stat = "count") + labs (title = "Percentage of Incident Location by Borough", x="Borough of Occurrence", y="Percentage of Incidents") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Incident Location") + theme_economist()
ggplot(after2005,aes(x = after2005$`Borough of Occurrence`, fill = after2005$`Incident Location`)) + geom_bar(position = "stack") + theme_economist() + labs(title = "Incident Location by Borough")
ggplot(after2005, aes(x = after2005$`Incident Year`, fill = after2005$`Complaint Filed Mode`)) +geom_bar(position = "stack") + theme_economist() + labs(title = "Complaint Filed Mode across years")
ggplot(hw4, aes(hw4$`Allegation Description`)) +geom_bar(width = 0.5, position = position_dodge(width = 0.5)) + theme(axis.text.x=element_text(angle=90, hjust=1)) + labs(title = " The Most Frequent Incidents Types")
ggplot(after2005,aes(x= after2005$`Received Year`, colour = after2005$`Complaint Has Video Evidence`)) + geom_density(data=after2005 ,aes(factor(after2005$`Complaint Has Video Evidence`)),alpha="1") + theme_classic() + theme_economist() + labs(title= "Denesity graph of complaints with or without video evidence", x="Allegation", y="Density")
ggplot(after2005,aes(x = after2005$`Encounter Outcome`, fill = after2005$`Allegation FADO Type`)) + geom_bar(position = "stack") + theme_economist() +labs(title = "Allegation FADO Type by Encounter Outcome")
fullinvestigate=after2005[after2005$`Is Full Investigation`=="TRUE",]
pie <- ggplot(data = fullinvestigate) +
geom_bar(mapping = aes(x = factor(1), fill = fullinvestigate$`Allegation FADO Type`), width = 1) +
coord_polar(theta = "y") + labs(title = "Not Full Investigation by Allegation FADO Type")
pie
tinvestigate=after2005[after2005$`Is Full Investigation`=="FALSE",]
pie <- ggplot(data = tinvestigate) +
geom_bar(mapping = aes(x = factor(1), fill = tinvestigate$`Allegation FADO Type`), width = 1) +
coord_polar(theta = "y") + labs(title = "Full Investigation by Allegation FADO Type")
pie
The goal of this assignment is to let us practice the exploratory data analysis. When I encounter the dataset that I am not familar with. The first step for me is to understand the main characteristics with data visualization. What I learned from this project is that the best practice is to start with basic scatter plot, histogram, pie chart and density chart to find interesting pattern underneath the data.
I tried to plot different graphics with ggplot2 package. I realized some graphics looks cool, but it’s hard to discover the pattern of data. I think this is also to do with the charateristics of our dataset. The dataset we used doesn’t have numerical variable, so we need to find the graphic types that are useful to analyze classification dataset.
Then, it’s important to clear the data. We can found the data seems not collect completely before 2006. This may impact the result when we include those incompleted data. When I worked on the exploratory data analysisi, I also noticed several rows have the same Unique Complaint Id. We need to careful to pre-processing the data. Otherwise, we may count the same case multiple times and impact to the final analysis.
Everything abobe is the first step to familiar our data and we can start to make the charts fancier with different combination and customization for futher analysis.