Objectives The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

  1. Barplot showing number of occurances by Borough
library(ggplot2)
library(readxl)
ccrb <- read_excel("~/Downloads/ccrb.xlsx",sheet = "Complaints_Allegations")
barplot(table(ccrb$Borough_of_Occurrence), xlab="Borough of Occurrence", ylab='Occurances', col="grey")

  1. Proportion of Full Investigations
pie(table(ccrb$Is_Full_Investigation))

Most occurances have not been fully investigated.

  1. Number of Cases closed by year
ggplot(ccrb, aes(x=ccrb$CloseYear, fill = ccrb$Is_Full_Investigation ))+geom_histogram(stat="count")+ labs (title = "Number of Cases closed", x="Year", y="Cases closed") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Is investigated")+scale_x_continuous(breaks = seq(1999,2016,1))

  1. Number of incidents with video Evidence
ggplot(ccrb, aes(x=ccrb$inc_year, fill = ccrb$cmpt_vd_evid))+geom_histogram(stat="count")+ labs (title = "Proportion of Incidents with Evidence", x="Year", y="Number Of Incidents") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Has Video Evidence")+scale_x_continuous(breaks = seq(1999,2016,1)) 

A small portion of incidents each year had video evidence. Infact, prior to 2010, there were no video evidence.

5.Frequency of Incidents Occurenced by Borough and Type

 ggplot(ccrb, aes(x=ccrb$Borough_of_Occurrence, fill=ccrb$alg_type)) +
  geom_histogram(stat="count") + 
  labs(title="Frequency of Incident Occurenced by Borough and Type", x="Borough", y="Frequence of Occurence") + 
  scale_fill_discrete(name="Allegation Type") +
  theme(legend.position = "bottom")

Abuse of Authority has been the major allegation type over the years

6.Histogram for Complaints Received by Year

hist(ccrb$ReceivedYear, 
     main="Histogram for Complaints Received by Year", xlab="Received Year", col="grey")

  1. Comparison of time between incident and when the case was received
ggplot(ccrb,aes(ccrb$inc_year,ccrb$ReceivedYear))+geom_point() + geom_smooth(method = lm) + 
  labs (title = "Incident Year Vs. Received Year", x="Incident Year", y="Received Year")+theme(panel.grid.minor = element_line(colour="white")) 

  1. Encounter outcomes by incident year
ggplot(ccrb, aes(y=`inc_year`,x=`enc_outcome`)) + geom_boxplot(aes(colour = 'red')) + labs(x= 'Encounter Year', y='Incident Year')

  1. Incident Location by Year using a boxplot.
ggplot(ccrb, aes(y=ccrb$inc_year,x=ccrb$inc_loc)) + geom_boxplot() + coord_flip() + 
  labs (title = "Incident Location by Year", x="Year of Incident", y="Incident Location")

  1. Type of Allegations with Density
ggplot(ccrb, aes(x=ccrb$ReceivedYear, colour = ccrb$alg_type)) +
  geom_density(data=ccrb ,aes(factor(ccrb$alg_type)),alpha="1") +
  theme_classic() +
  labs(title= "Types of Allegations with Its Density", x="Allegation", y="Density") 

Exploratory data analysis (EDA) helps in giving a good understanding of data. It helps in understanding different components of the data and distribution of variables in the data. Exploratory data analysis can be of two types- Graphical and Quantitative. In this assignment, I have used Graphical Exploratory Data analysis to get a better understanding of data. Moreover, EDA is an important step in identifying the key variables that would go into the model. It also helps in identifying some of transformation required in the data before building the model.