Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:

# This is a top section

## This is a subsection

Your final document should include at minimum 10 visualization. Each should include a brief statement of what they show.

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

# Exploratory data analysis (EDA) helps in understanding the charts and relationships between different variables. It is easier to infer results. The EDA approach is precisely that--an approach--not a set of techniques, but an attitude/philosophy about how a data analysis should be carried out.

# load the packages used for this data visulization analysis 
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# read data
data <- read.csv('C:/Users/SurbhiK/Downloads/ccrb.csv',header = TRUE)


# 1. Borough in which Incident occured of occurence each year
ggplot(data, aes(x=data$Received.Year, fill= data$Borough.of.Occurrence)) + geom_histogram(stat = "count") + labs (title = "Place of Occurence by Year", x="Year", y="Places") + theme (legend.position = "left") + scale_fill_discrete(name = "Occurence Place")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

## The above histogram shows the occurence of incidents places received between 2000 and 2016. The toal number of complaints have reduced since 2007 as there is a declining trend.Reason for this decline in all the areas might be a better quality of job being done by the police. 

# 2. Frequency of Complaint Type as per Borough 
ggplot(data, aes(x=data$Allegation.FADO.Type, fill= data$Borough.of.Occurrence)) + geom_histogram(stat = "count") + labs (title = "Frequency of Complaint Type by Borough", x="Borough of Occurence", y="Frequency of Complaints") + theme (legend.position = "left") + scale_fill_discrete(name = "Borough")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Highest type of complaint is 'Abuse of Authority'. Maximum number of times Absue of Authority happended in Brooklyn. Offensive language is least reported and in total all the boroughs result in something similar for this type of complaint. It may be becasue not a lot of people report 'Offensive Language' to police. 

# 3. Full Investigation as per Borough
ggplot(data, aes(x=data$Is.Full.Investigation, fill= data$Borough.of.Occurrence)) + geom_bar(stat = "count") + labs (title = "Investigation by Borough", x="Is Full Investigation", y="Number") + theme (legend.position = "left") + scale_fill_discrete(name = "Borough")

## The chart above shows that majority of the time full investigation doesn't happen for all complaint irrespective of the Borough.In all police should do a better job of getting investigations complete rather than just regiestering complaint.

# 4. Number of Incident Occurred Each Year as per Video Evidence presence 
ggplot(data, aes(x=data$Incident.Year, fill= data$Complaint.Has.Video.Evidence)) + geom_histogram(stat = "count") + labs (title = "Number of Incident Occurred Each Year by Evidence", x="Incident Year", y="Number") + theme (legend.position = "left") + scale_fill_discrete(name = "Has Video Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Even though a very small percentage of complaints have videos but there has been an increasing trend in the way videos are present as evidence in various cases. I am thinkig becasue of the use of smart phones this is happening.

# 5. Number of Cases Closed Each Year as per Investigation 
ggplot(data, aes(x=data$Close.Year, fill= data$Is.Full.Investigation)) + geom_histogram(stat = "count") + labs (title = "Number of Case Closed Each Year by Investigation", x="Close Year", y="Number") + theme (legend.position = "left") + scale_fill_discrete(name = "Is Full Investigation")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

## No matter what the year is half of the cases turn up to be unsolved. Police try and improve this result and solve more number of cases. 

# 6. Number of Incidents Occurred Each Year as per Outcome
ggplot(data, aes(x=data$Incident.Year, fill= data$Encounter.Outcome)) + geom_histogram(stat = "count") + labs (title = "Number of Incidents Occurred Each Year by Outcome", x="Incident Year", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Encountered Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

## The chart above shows a large number of cases go without any arrest or summons.

# 7. Encounter Outcome Pie
E<-table(data$Encounter.Outcome)
pie(E)

## As shown in the previous chart it can be seen in pie chart too Arrest and 'No Arrest or Summons' are kind of same.

# 8. Relationship between Incident Year and Received Year
ggplot(data, aes(x=data$Incident.Year, y= data$Received.Year)) + geom_point() + geom_smooth(method = lm) + labs (title = "Relationship between Incident Year and Received Year", x="Incident Year", y="Received Year") 

## The scatter plot shows a positive relation between the incident year and complaint received year.

# 9. Relationship between Close Year and Received Year
ggplot(data, aes(x=data$Received.Year, y= data$Close.Year)) + geom_point() + geom_smooth(method = lm) + labs (title = "Relationship between Close Year and Received Year", x="Received Year", y="Close Year") 

## In all the scatter plot shows a positive relation between complaint receiving year and close year. We can see a lot of outliers for the year 2010 and 2011.

# 10.Number of Complains by Allegation Type 
ggplot(data, aes(x=data$Allegation.FADO.Type, fill= data$Allegation.FADO.Type)) + geom_bar(stat = "count") + labs (title = "Number of Complains by Allegation Type", x="Type", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Type")

## The chart above shows that Abuse of Authority type is the most frequent allegation followed by force and then is discourtesy. Offensive language is the lease filed allegation amongst all as might be the case not a lot of people are reporting it.