We begin this exercise by downloading data from desired source and loading it using a library that read excel files and proceeds further by finding EDA related answers using different visualization.
library(ggplot2)
library(readxl)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.3
## v tidyr 1.0.0 v dplyr 0.8.3
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.3 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
setwd("C:/Users/shoukhan/Documents/Harrisburg University/Anly-512-Data-Visualization/Summer Course")
dataset <- read_xlsx("ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
d1<-dataset[!(dataset$`Borough of Occurrence`=="NA"),]
crime.freq <- table(d1$`Borough of Occurrence`)
label <- names(crime.freq)
label <- paste("",label,"-")
percentage <- round(crime.freq/sum(crime.freq)*100)
lableWithPercent <- paste(label, percentage)
lableWithPercent <- paste(lableWithPercent, "%")
pie(crime.freq, lableWithPercent, main = "Crime distribution by judristiction")
Based on the pie chart, it can easily interpreted that the top three places are Brooklyn, Bronx and Manhattan. Moreover, it also show that there are less number of complains outside new york city.
dataByYear=subset(dataset, `Received Year` >= 2006)
ggplot(dataByYear, aes(x=`Received Year`, fill= `Allegation FADO Type`)) +
geom_histogram(stat = "count") +
labs (title = "Distribution of Complaints by type and year", x="Year", y="Complaints") +
theme (legend.position = "right") +
scale_fill_discrete(name = "Allegation Type") + coord_flip()
This vizualization interprets that type of complaints year over year increases but the most of causes remains the same.
d <- aggregate(`Complaint Has Video Evidence`~ `Received Year`,FUN=length, data = dataset)
d <- d[!(d$`Complaint Has Video Evidence` < 10000),]
ggplot(d, aes(`Received Year`,`Complaint Has Video Evidence`))+geom_line()
The above charts gives a clear picture of increase in video evidence captured by users year over year and also for some reason its declining
ggplot(d1, aes(y=d1$`Received Year`,x=d1$`Borough of Occurrence`))+labs(title = "Comparision of Received Year vs Borough", x="Recieved Year", y="Bourough of Occurence")+geom_boxplot()+coord_flip()
The box plot let us interpret the average and also an easy way to spot outliers.
ggplot(data=d1, aes(x=`Incident Year`, colour = `Borough of Occurrence`)) +
geom_density(data=d1 ,aes(factor(`Borough of Occurrence`)),alpha="1") +
theme_classic() +
labs(title= "Borough of Occurrence - Density", x="Borough", y="Density")
Density plot are useful to capture the distribution of data and hee also it helps to interpret incident happened over the years by borough.
d2 <- dataset[!(dataset$`Allegation FADO Type`=="NA"),]
ggplot(data=d2, aes(x=`Incident Year`, colour = `Allegation FADO Type`)) +
geom_density(data=d2 ,aes(factor(`Allegation FADO Type`)),alpha="1") +
theme_classic() +
labs(title= "Type of Allegation - Density", x="Allegation", y="Density")
This visualization help to interpret the distribution by Allegation of Complaints.
d3 <- unique(dataset[c("UniqueComplaintId","Received Year","Incident Year", "Close Year")])
d3$Length = d3$`Close Year` - d3$`Received Year`
d4 = aggregate(d3[, 5], list(d3$`Received Year`), mean)
names(d4) = c("Received Year", "Length")
ggplot(data=d4, aes(x=`Received Year`, y=Length)) +
labs(title = "Average Timeline of Investigation") +
geom_line()
The line chart interprets the average investiagtion time for registered complaints
res.freq <- table(dataset$`Encounter Outcome`)
label <- names(res.freq)
label <- paste("",label,"-")
percentage <- round(res.freq/sum(res.freq)*100)
lableWithPercent <- paste(label, percentage)
lableWithPercent <- paste(lableWithPercent, "%")
pie(res.freq, lableWithPercent, col = rainbow(length(label)) ,main = "Outcome of Incidents")
A pie chart the represents the outcome of complaints registered.
d4 <- dataset %>%
group_by(`Complaint Filed Mode`) %>%
count(`Received Year`)
ggplot(d4, aes(fill=`Complaint Filed Mode`, y=n, x=`Received Year`)) +
geom_bar(position="stack", stat="identity") + labs(title= "Communication Preference over years", x="Year", y="Count")
This visualization represents the common type of communication over the years, intrestingly with an increase of dot com still most complains are registered through phone.
d5 <- dataset %>%
group_by(`Incident Location`, `Encounter Outcome`) %>%
count(`Incident Year`)
ggplot(d5, aes(fill=`Encounter Outcome`, y=n, x=`Incident Location`)) +
geom_bar(position="stack", stat="identity") + labs(title= "Frequent location of incidents and outcome", x="Incident Location", y="Count") + coord_flip()
A stacked bar plot represents the common incident locations and what would be the expected outcome in most of the cases.
Designing ten visualizations for the given dataset was a good exercise and was very helpful when the main goal is to perform EDA. EDA helps to design research questions and gather information about the dataset working with. This analysis answered ten EDA questions, started with finding the top borough where complaints register to the frequent locations where incident took place and what was the outcome. A visualization that suprises the most in current decade is the common form of communication is phone with people are familiar with Apps and Web and it might be because of two reason either response time or poor user interface of other communication type. In this analysis, the results represents that most common type of crime is Abuse of Authority, the most common city with complaints registered is Brooklyn, the average time for investigation is less than a year and most common location for incident is Street/Highway