Objectives The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(ggplot2)
library(ggthemes)
library(readxl)
install.packages("dplyr",repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/xj/tl6pm98x5qb3b_94_pxtg29h0000gn/T//RtmpXNFUgo/downloaded_packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readxl)
complaint <- read_excel("/Users/ningyan/Downloads/problemset 4.xlsx",sheet = 2)
data <- data.frame(complaint)
str(data)
## 'data.frame':    204397 obs. of  16 variables:
##  $ DateStamp                                  : POSIXct, format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num  11 18 18 18 18 18 18 18 18 18 ...
##  $ Close.Year                                 : num  2006 2006 2006 2006 2006 ...
##  $ Received.Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Borough.of.Occurrence                      : chr  "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is.Full.Investigation                      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint.Has.Video.Evidence               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint.Filed.Mode                       : chr  "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint.Filed.Place                      : chr  "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint.Contains.Stop...Frisk.Allegations: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident.Location                          : chr  "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident.Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Encounter.Outcome                          : chr  "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason.For.Initial.Contact                 : chr  "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation.FADO.Type                       : chr  "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation.Description                     : chr  "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...

Visualization 1: Complaints close year analysis

From this visualization, we can find out the trend of closed complaints by year.

library(ggplot2)
ggplot(data=complaint, aes(x=complaint$`Close Year`, fill=complaint$`Close Year`)) + 
  geom_histogram(stat = "count") + 
  labs(title="Complaints by Close Year", x="Close Year", y="Number of Complaints")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Visualization 2: Complaints receive year analysis

From this visualization, we can find out the trend of received complaints by year.

library(ggplot2)
ggplot(data=complaint, aes(x=complaint$`Received Year`, fill=complaint$`Received Year`)) + 
  geom_histogram(stat = "count") + 
  labs(title="Complaints by Received Year", x="Received Year", y="Number of Complaints")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Visualization 3: Borough Location vs Encounter outcome Analysis

From this visualization, we can find out the encountered outcome for each borough location.

library(ggplot2)
ggplot(data= complaint, aes(x=complaint$`Borough of Occurrence`, fill=complaint$`Encounter Outcome`)) +  
  geom_bar(stat="count", width=0.4) + 
  labs(x="Borough of Location", y="Encounter Outcome types") 

Visualization 4: Incident Location vs Investigation Analysis

From this visualization, we can find out which incident location has more full investigation.

library(ggplot2)
ggplot(data= complaint, aes(x=complaint$`Incident Location`, fill=complaint$`Is Full Investigation`)) +  
  geom_bar(stat="count", width=0.4) + 
  labs(x="Incident Location", y="Investigation") 

Visualization 5: Incident Location vs Received Year Analysis

From this visualization, we can find out the percentages for incident locations by year.

library(ggplot2)
ggplot(data= complaint, aes(x=complaint$`Received Year`, fill=complaint$`Incident Location`)) +  
  geom_bar(stat="count", width=0.4) + 
  labs(x="Complaints Received Year", y="Incident Location") 

Visualization 6: Incident Location vs Vedio Evidence Analysis

From this visualization, we can find out which incident location tends out to more likely has vedio evidence.

library(ggplot2)
ggplot(data= complaint, aes(x=complaint$`Complaint Has Video Evidence`, fill=complaint$`Incident Location`)) +  
  geom_bar(stat="count", width=0.4) + 
  labs(x="Compliance Vedio Evidence", y="Incident Location") 

Visualization 7: Complaint Geography Location vs Investigation Situation Analysis

From this visualization, we can find out the investigation situations by geography.

library(ggplot2)
ggplot(data= complaint, aes(x=complaint$`Borough of Occurrence`, fill=complaint$`Is Full Investigation`)) +  
  geom_bar(stat="count", width=0.4) + 
  labs(x="Complaint Geography Location", y="Investigation Situatio") 

Visualization 8: Complaint Mode Analysis

From this visualization, we can find out which is the most common complaint mode.

library(ggplot2)
ggplot(data= complaint, aes(x=complaint$`Complaint Filed Mode`)) +  
  geom_bar(stat="count", width=0.4) + 
  labs(x="Complaint Mode", y="Numbers") 

Visualization 9: Complaint Place Analysis

From this visualization, we can find out which is the most common complaint place.

library(ggplot2)
ggplot(data= complaint, aes(x=complaint$`Complaint Filed Place`)) +  
  geom_bar(stat="count", width=0.4) + 
  labs(x="Complaint Place", y="Numbers") 

Visualization 10: Incident Outcome vs Full investigation

From this visualization, we can find out if full investigation affects incident outcome.

library(ggplot2)
ggplot(data= complaint, aes(x=complaint$`Is Full Investigation`, fill=complaint$`Encounter Outcome`)) +  
  geom_bar(stat="count", width=0.4) + 
  labs(x="Full Investigation", y="Incident Outcome") 

To summarize, exploratory data analysis (EDA) is a great tool to help us analyze the hidden relationship between variables. Using a picture, we can notice something we never expected to see. Analysts could use any analysis tool to show the data to audience.