Objectives The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
library(ggplot2)
library(readxl)
ccrb <- read_excel("~/Downloads/ccrb.xlsx",sheet = "Complaints_Allegations")
barplot(table(ccrb$Borough_of_Occurrence), xlab="Borough of Occurrence", ylab='Occurances', col="grey")
pie(table(ccrb$Is_Full_Investigation))
Most occurances have not been fully investigated.
ggplot(ccrb, aes(x=ccrb$CloseYear, fill = ccrb$Is_Full_Investigation ))+geom_histogram(stat="count")+ labs (title = "Number of Cases closed", x="Year", y="Cases closed") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Is investigated")+scale_x_continuous(breaks = seq(1999,2016,1))
ggplot(ccrb, aes(x=ccrb$inc_year, fill = ccrb$cmpt_vd_evid))+geom_histogram(stat="count")+ labs (title = "Proportion of Incidents with Evidence", x="Year", y="Number Of Incidents") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Has Video Evidence")+scale_x_continuous(breaks = seq(1999,2016,1))
A small portion of incidents each year had video evidence. Infact, prior to 2010, there were no video evidence.
5.Frequency of Incidents Occurenced by Borough and Type
ggplot(ccrb, aes(x=ccrb$Borough_of_Occurrence, fill=ccrb$alg_type)) +
geom_histogram(stat="count") +
labs(title="Frequency of Incident Occurenced by Borough and Type", x="Borough", y="Frequence of Occurence") +
scale_fill_discrete(name="Allegation Type") +
theme(legend.position = "bottom")
Abuse of Authority has been the major allegation type over the years
6.Histogram for Complaints Received by Year
hist(ccrb$ReceivedYear,
main="Histogram for Complaints Received by Year", xlab="Received Year", col="grey")
ggplot(ccrb,aes(ccrb$inc_year,ccrb$ReceivedYear))+geom_point() + geom_smooth(method = lm) +
labs (title = "Incident Year Vs. Received Year", x="Incident Year", y="Received Year")+theme(panel.grid.minor = element_line(colour="white"))
ggplot(ccrb, aes(y=`inc_year`,x=`enc_outcome`)) + geom_boxplot(aes(colour = 'red')) + labs(x= 'Encounter Year', y='Incident Year')
ggplot(ccrb, aes(y=ccrb$inc_year,x=ccrb$inc_loc)) + geom_boxplot() + coord_flip() +
labs (title = "Incident Location by Year", x="Year of Incident", y="Incident Location")
ggplot(ccrb, aes(x=ccrb$ReceivedYear, colour = ccrb$alg_type)) +
geom_density(data=ccrb ,aes(factor(ccrb$alg_type)),alpha="1") +
theme_classic() +
labs(title= "Types of Allegations with Its Density", x="Allegation", y="Density")
Exploratory data analysis (EDA) helps in giving a good understanding of data. It helps in understanding different components of the data and distribution of variables in the data. Exploratory data analysis can be of two types- Graphical and Quantitative. In this assignment, I have used Graphical Exploratory Data analysis to get a better understanding of data. Moreover, EDA is an important step in identifying the key variables that would go into the model. It also helps in identifying some of transformation required in the data before building the model.