The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
# This is a top section
## This is a subsection
library(readr)
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ggthemes)
library(stringr)
getwd()
setwd("~/Desktop/HU/ANLY512/R")
ccrb <- read.csv("ccrb.csv")
str(ccrb)
## 'data.frame': 204397 obs. of 16 variables:
## $ DateStamp : Factor w/ 1 level "11/29/2016": 1 1 1 1 1 1 1 1 1 1 ...
## $ UniqueComplaintId : int 11 18 18 18 18 18 18 18 18 18 ...
## $ Close.Year : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
## $ Received.Year : int 2005 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
## $ Borough.of.Occurrence : Factor w/ 6 levels "Bronx","Brooklyn",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ Is.Full.Investigation : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ Complaint.Has.Video.Evidence : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Complaint.Filed.Mode : Factor w/ 7 levels "Call Processing System",..: 6 7 7 7 7 7 7 7 7 7 ...
## $ Complaint.Filed.Place : Factor w/ 14 levels "CCRB","Comm. to Combat Police Corruption",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Complaint.Contains.Stop...Frisk.Allegations: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Incident.Location : Factor w/ 15 levels "Apartment/house",..: 14 14 14 14 14 14 14 14 14 14 ...
## $ Incident.Year : int 2005 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
## $ Encounter.Outcome : Factor w/ 4 levels "Arrest","No Arrest or Summons",..: 2 1 1 1 1 1 1 1 1 1 ...
## $ Reason.For.Initial.Contact : Factor w/ 49 levels "Aided case","Arrest/Complainant",..: 23 32 32 32 32 32 32 32 32 32 ...
## $ Allegation.FADO.Type : Factor w/ 4 levels "Abuse of Authority",..: 1 1 2 2 2 3 3 3 3 3 ...
## $ Allegation.Description : Factor w/ 56 levels "Action","Animal",..: 48 35 56 56 56 27 27 27 27 27 ...
nrow(ccrb)
## [1] 204397
ncol(ccrb)
## [1] 16
head(ccrb, 5) # Look at the top and bottom of data
## DateStamp UniqueComplaintId Close.Year Received.Year
## 1 11/29/2016 11 2006 2005
## 2 11/29/2016 18 2006 2004
## 3 11/29/2016 18 2006 2004
## 4 11/29/2016 18 2006 2004
## 5 11/29/2016 18 2006 2004
## Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
## 1 Manhattan FALSE FALSE
## 2 Brooklyn TRUE FALSE
## 3 Brooklyn TRUE FALSE
## 4 Brooklyn TRUE FALSE
## 5 Brooklyn TRUE FALSE
## Complaint.Filed.Mode Complaint.Filed.Place
## 1 On-line website CCRB
## 2 Phone CCRB
## 3 Phone CCRB
## 4 Phone CCRB
## 5 Phone CCRB
## Complaint.Contains.Stop...Frisk.Allegations Incident.Location
## 1 FALSE Street/highway
## 2 FALSE Street/highway
## 3 FALSE Street/highway
## 4 FALSE Street/highway
## 5 FALSE Street/highway
## Incident.Year Encounter.Outcome
## 1 2005 No Arrest or Summons
## 2 2004 Arrest
## 3 2004 Arrest
## 4 2004 Arrest
## 5 2004 Arrest
## Reason.For.Initial.Contact Allegation.FADO.Type
## 1 Other Abuse of Authority
## 2 PD suspected C/V of violation/crime - street Abuse of Authority
## 3 PD suspected C/V of violation/crime - street Discourtesy
## 4 PD suspected C/V of violation/crime - street Discourtesy
## 5 PD suspected C/V of violation/crime - street Discourtesy
## Allegation.Description
## 1 Threat of arrest
## 2 Refusal to obtain medical treatment
## 3 Word
## 4 Word
## 5 Word
tail(ccrb, 5)
## DateStamp UniqueComplaintId Close.Year Received.Year
## 204393 11/29/2016 69476 2016 2016
## 204394 11/29/2016 69476 2016 2016
## 204395 11/29/2016 69476 2016 2016
## 204396 11/29/2016 69476 2016 2016
## 204397 11/29/2016 69476 2016 2016
## Borough.of.Occurrence Is.Full.Investigation
## 204393 Brooklyn TRUE
## 204394 Brooklyn TRUE
## 204395 Brooklyn TRUE
## 204396 Brooklyn TRUE
## 204397 Brooklyn TRUE
## Complaint.Has.Video.Evidence Complaint.Filed.Mode
## 204393 FALSE On-line website
## 204394 FALSE On-line website
## 204395 FALSE On-line website
## 204396 FALSE On-line website
## 204397 FALSE On-line website
## Complaint.Filed.Place Complaint.Contains.Stop...Frisk.Allegations
## 204393 CCRB FALSE
## 204394 CCRB FALSE
## 204395 CCRB FALSE
## 204396 CCRB FALSE
## 204397 CCRB FALSE
## Incident.Location Incident.Year Encounter.Outcome
## 204393 Apartment/house 2016 Arrest
## 204394 Apartment/house 2016 Arrest
## 204395 Apartment/house 2016 Arrest
## 204396 Apartment/house 2016 Arrest
## 204397 Apartment/house 2016 Arrest
## Reason.For.Initial.Contact Allegation.FADO.Type
## 204393 Execution of search warrant Discourtesy
## 204394 Execution of search warrant Discourtesy
## 204395 Execution of search warrant Offensive Language
## 204396 Execution of search warrant Offensive Language
## 204397 Execution of search warrant Offensive Language
## Allegation.Description
## 204393 Word
## 204394 Word
## 204395 Gender
## 204396 Gender
## 204397 Gender
names(ccrb)
## [1] "DateStamp"
## [2] "UniqueComplaintId"
## [3] "Close.Year"
## [4] "Received.Year"
## [5] "Borough.of.Occurrence"
## [6] "Is.Full.Investigation"
## [7] "Complaint.Has.Video.Evidence"
## [8] "Complaint.Filed.Mode"
## [9] "Complaint.Filed.Place"
## [10] "Complaint.Contains.Stop...Frisk.Allegations"
## [11] "Incident.Location"
## [12] "Incident.Year"
## [13] "Encounter.Outcome"
## [14] "Reason.For.Initial.Contact"
## [15] "Allegation.FADO.Type"
## [16] "Allegation.Description"
ccrb <- ccrb %>%
mutate(num.yrs = Close.Year - Received.Year) # create new var to how number of year(s) taken for a compaint to close
ggplot(ccrb, aes(Allegation.FADO.Type, num.yrs))+
geom_boxplot() +
ggtitle("Boxplot showing statistic summary of
allegation type vs number of year of active complaints")
ggplot(ccrb, aes(y= Incident.Year, x= Allegation.FADO.Type)) +
geom_boxplot() +
ggtitle("Boxplot showing distribution of allegation types over the years")
ggplot(ccrb, aes(x= Borough.of.Occurrence, y= Incident.Year))+
geom_boxplot() +
labs(title='Boxplot showing distribution of borough of occurence over the years')
summary(ccrb$Borough.of.Occurrence) # Look at table of statistic summary
## Bronx Brooklyn Manhattan Outside NYC Queens
## 49442 72215 42104 170 30883
## Staten Island NA's
## 9100 483
ggplot(ccrb, aes(Borough.of.Occurrence)) +
geom_bar(color= "white", fill= "tomato3") +
ggtitle("Barplot showing numbers of complaints in each borough") +
scale_x_discrete(labels = function(Borough.of.Occurrence) str_wrap(Borough.of.Occurrence, width = 10)) +
scale_y_continuous(breaks = seq(0, 73000, 5000))
ggplot(ccrb, aes(x = Complaint.Filed.Mode)) +
geom_bar(stat = 'count', color= "white", fill= "tomato3") +
labs(title = 'Barplot showing numbers of each mode used for filling complaints') +
scale_x_discrete(labels = function(Complaint.Filed.Mode) str_wrap(Complaint.Filed.Mode, width = 10))
ggplot(ccrb, aes(Incident.Location)) +
geom_bar(color= "white", fill= "tomato3") +
coord_flip() +
ggtitle("Barplot showing numbers of incidents occurred in different locations")
ggplot(ccrb, aes(x = Is.Full.Investigation, fill = Complaint.Has.Video.Evidence)) +
geom_bar(stat = 'count') +
labs(title = 'Stacked-bar plot showing joint distribution of investigation
and VDO evidence on the complaints') +
scale_fill_discrete(name = 'Complaint Has Video Evidence')
ggplot(ccrb, aes(x = Encounter.Outcome, fill = Is.Full.Investigation)) +
geom_bar(stat = 'count') +
labs(title='Stacked-bar plot showing joint distribution of Encounter Outcome anf Full Investigation') +
scale_fill_discrete(name = 'Full Investigation')
ggplot(ccrb, aes(x= Borough.of.Occurrence, fill= Allegation.FADO.Type)) +
geom_histogram(stat ="count") +
labs (title = "Stacked-bar plot showing distribution of allegation type
in each borough of occurence") +
scale_x_discrete(labels = function(Borough.of.Occurrence) str_wrap(Borough.of.Occurrence, width = 10))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ccrb1 <- ccrb %>%
group_by(Received.Year) %>%
summarize(total = n_distinct(UniqueComplaintId)) %>%
select(Received.Year, total)
ggplot(ccrb1, aes(x = Received.Year, y = total)) +
geom_line() +
ggtitle('Line plot showing trend of filed complaints over years')
To summarise, Exploratory Data Analysis (EDA) helps us to roughly understand the data in order for us to be able to identify relationships between the interested variables, trends, patterns, problems, missing data, errors, and outliers. By looking at the structure of data, summarizing data with statistical analysis, and creating basic plots, the process allows us as the investigator to dicide about what is interesting in our data and what is not. The goal of EDA does not focus on inference or make presentable plots. Rather, it is to show data, obtain evidence, identify interesting patterns, and at the same time filter out variables that are not of our interest.