Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Load data and package

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggalt)
## Warning: package 'ggalt' was built under R version 3.4.2
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.2
ccrb <- read_excel("C:/Users/az59/Desktop/512/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
ccrb_df <- data.frame(ccrb)
str(ccrb_df)
## 'data.frame':    204397 obs. of  16 variables:
##  $ DateStamp                                  : POSIXct, format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num  11 18 18 18 18 18 18 18 18 18 ...
##  $ Close.Year                                 : num  2006 2006 2006 2006 2006 ...
##  $ Received.Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Borough.of.Occurrence                      : chr  "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is.Full.Investigation                      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint.Has.Video.Evidence               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint.Filed.Mode                       : chr  "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint.Filed.Place                      : chr  "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint.Contains.Stop...Frisk.Allegations: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident.Location                          : chr  "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident.Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Encounter.Outcome                          : chr  "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason.For.Initial.Contact                 : chr  "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation.FADO.Type                       : chr  "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation.Description                     : chr  "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...

Vis 1 Complaints reveived in NYC

The following graph shows the complaints received between 1999 and 2016 in NYC. We can easily tell which year we have the most cases

ggplot(ccrb, aes(x=ccrb$`Received Year`, fill=ccrb$`Allegation FADO Type`))+   
  geom_histogram(stat = "count") + 
  labs(title="Complaints Received in every Year", x="Complaints Received Year", y="No. of Complaints")+ 
  scale_fill_discrete(name="Allegation FADO Type") +  theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Vis 2 Pie chart of Borough Occurence

The following Pie chart shows where the crimes are taken

Occ <- table(ccrb$'Borough of Occurrence')
pie(Occ)

Vis 3 Bar chart of Complaint Field Mode

The following bar chart shows the most method using is Phone

Temp <- table(ccrb$'Complaint Filed Mode')
barplot(Temp, xlab='Complaint Filed Mode', col='light blue')

Vis 4 Bar chart of solving case duration

The following bar chart shows the time solving the cases

Temp <- table(ccrb$"Incident Year", ccrb$"Close Year")
barplot((Temp),main="Cases Closed in One year", xlab="Close Year", ylab="Number of Cases",col=c("pink","violet","lightgreen","grey", "yellow", "lightblue","white","darkolivegreen","orange4",
"darkorchid","blue3","darkgreen"), legend = rownames(Temp))

Vis 5 Scatter line graph of the incident year and the close year

attach(ccrb)
plot(ccrb$'Incident Year', ccrb$'Close Year', main=" Incident Year vs Close Year", xlab="Close Year", ylab="Incident Year",pch=20)
fit1 <- lm(ccrb$'Incident Year'~ccrb$'Close Year', data=ccrb)
abline(fit1,col="purple")

Vis 6 Scatter line graph of the relationship of received year and incident occured year

ggplot(ccrb, aes(`Received Year`, `Incident Year`)) +
  geom_point() +
  geom_smooth()+labs(title="Relationship between Incident occured year and received year")
## `geom_smooth()` using method = 'gam'

Vis 7 The Bar chart of whether there is video evidence by year

ccrb_d <- distinct(ccrb[1:14],  .keep_all = TRUE)
ggplot(ccrb_d, aes(x=ccrb_d$`Incident Year`, fill=ccrb_d$`Complaint Has Video Evidence`))+   
  geom_histogram(stat = "count") + 
  labs(title="Number of Complaint that has Video Evidence every year", y="No. of Complaints", x="Incident Year")+ 
  scale_fill_discrete(name="Complaint Has Video Evidence") +  theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Vis 8 Bar chart of Complaint Field Mode with number on the histogram

The following bar chart shows the most method using is Phone

Filed_Mode_group <- group_by(ccrb, `Complaint Filed Mode` )
Filed_Mode <- summarise(Filed_Mode_group, count= n())

ggplot(Filed_Mode, aes(x= `Complaint Filed Mode`, y=count, label= count)) +  geom_bar(position = "dodge", stat = "identity", fill = "#FF9936")+  theme(axis.text.x = element_text(angle = 20,size = 10, vjust = 0.5))+
  labs(title="Complaint Filed Mode")+
  xlab("Complaint Filed Mode") + ylab("No. of Complaints")    + geom_text(size = 4, position = position_stack(vjust = 1.2))

Vis 9 Bar chart of Incident Locations

The following bar chart shows the locations of where the crimes occured

ccrb_d <- distinct(ccrb[1:14],  .keep_all = TRUE)
ggplot(ccrb_d, aes(x=ccrb_d$`Borough of Occurrence`, fill=ccrb_d$`Incident Location`))+   
  geom_histogram(stat = "count") + 
  labs(title="Incident Location", x="Borough of Occurrence", y="No. of Complaints")+ 
  scale_fill_discrete(name="Incident Location") +  theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Vis 10 Box plot of incident lacations comparision

pos1 <- ggplot(ccrb, aes(y=`Incident Year`,x=`Incident Location`))
pos1 + geom_boxplot() + coord_flip()

#Summary Exploratory Data Analysis (EDA) is of very useful in data expression. Using this people can have very brief and clear idea of what the data is saying and what the data is used for. Instead of the boring numbers, the data can be colorful and more attractive. By exploring the relationship of variables, we may find the pattern of people’s complaints and the incidents, then try to figure out the targeted solution to decrease the incidents and improve government’s work efficiency.