Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

# This is a top section

## This is a subsection

Setup & Prepaparation

Library packages

library(readr)
library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(ggthemes)
library(stringr)

Read in data

getwd()
setwd("~/Desktop/HU/ANLY512/R")
ccrb <- read.csv("ccrb.csv")

Check data

str(ccrb)
## 'data.frame':    204397 obs. of  16 variables:
##  $ DateStamp                                  : Factor w/ 1 level "11/29/2016": 1 1 1 1 1 1 1 1 1 1 ...
##  $ UniqueComplaintId                          : int  11 18 18 18 18 18 18 18 18 18 ...
##  $ Close.Year                                 : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ Received.Year                              : int  2005 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
##  $ Borough.of.Occurrence                      : Factor w/ 6 levels "Bronx","Brooklyn",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ Is.Full.Investigation                      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint.Has.Video.Evidence               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint.Filed.Mode                       : Factor w/ 7 levels "Call Processing System",..: 6 7 7 7 7 7 7 7 7 7 ...
##  $ Complaint.Filed.Place                      : Factor w/ 14 levels "CCRB","Comm. to Combat Police Corruption",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Complaint.Contains.Stop...Frisk.Allegations: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident.Location                          : Factor w/ 15 levels "Apartment/house",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ Incident.Year                              : int  2005 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
##  $ Encounter.Outcome                          : Factor w/ 4 levels "Arrest","No Arrest or Summons",..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ Reason.For.Initial.Contact                 : Factor w/ 49 levels "Aided case","Arrest/Complainant",..: 23 32 32 32 32 32 32 32 32 32 ...
##  $ Allegation.FADO.Type                       : Factor w/ 4 levels "Abuse of Authority",..: 1 1 2 2 2 3 3 3 3 3 ...
##  $ Allegation.Description                     : Factor w/ 56 levels "Action","Animal",..: 48 35 56 56 56 27 27 27 27 27 ...
nrow(ccrb)
## [1] 204397
ncol(ccrb)
## [1] 16
head(ccrb, 5)     # Look at the top and bottom of data
##    DateStamp UniqueComplaintId Close.Year Received.Year
## 1 11/29/2016                11       2006          2005
## 2 11/29/2016                18       2006          2004
## 3 11/29/2016                18       2006          2004
## 4 11/29/2016                18       2006          2004
## 5 11/29/2016                18       2006          2004
##   Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
## 1             Manhattan                 FALSE                        FALSE
## 2              Brooklyn                  TRUE                        FALSE
## 3              Brooklyn                  TRUE                        FALSE
## 4              Brooklyn                  TRUE                        FALSE
## 5              Brooklyn                  TRUE                        FALSE
##   Complaint.Filed.Mode Complaint.Filed.Place
## 1      On-line website                  CCRB
## 2                Phone                  CCRB
## 3                Phone                  CCRB
## 4                Phone                  CCRB
## 5                Phone                  CCRB
##   Complaint.Contains.Stop...Frisk.Allegations Incident.Location
## 1                                       FALSE    Street/highway
## 2                                       FALSE    Street/highway
## 3                                       FALSE    Street/highway
## 4                                       FALSE    Street/highway
## 5                                       FALSE    Street/highway
##   Incident.Year    Encounter.Outcome
## 1          2005 No Arrest or Summons
## 2          2004               Arrest
## 3          2004               Arrest
## 4          2004               Arrest
## 5          2004               Arrest
##                     Reason.For.Initial.Contact Allegation.FADO.Type
## 1                                        Other   Abuse of Authority
## 2 PD suspected C/V of violation/crime - street   Abuse of Authority
## 3 PD suspected C/V of violation/crime - street          Discourtesy
## 4 PD suspected C/V of violation/crime - street          Discourtesy
## 5 PD suspected C/V of violation/crime - street          Discourtesy
##                Allegation.Description
## 1                    Threat of arrest
## 2 Refusal to obtain medical treatment
## 3                                Word
## 4                                Word
## 5                                Word
tail(ccrb, 5)
##         DateStamp UniqueComplaintId Close.Year Received.Year
## 204393 11/29/2016             69476       2016          2016
## 204394 11/29/2016             69476       2016          2016
## 204395 11/29/2016             69476       2016          2016
## 204396 11/29/2016             69476       2016          2016
## 204397 11/29/2016             69476       2016          2016
##        Borough.of.Occurrence Is.Full.Investigation
## 204393              Brooklyn                  TRUE
## 204394              Brooklyn                  TRUE
## 204395              Brooklyn                  TRUE
## 204396              Brooklyn                  TRUE
## 204397              Brooklyn                  TRUE
##        Complaint.Has.Video.Evidence Complaint.Filed.Mode
## 204393                        FALSE      On-line website
## 204394                        FALSE      On-line website
## 204395                        FALSE      On-line website
## 204396                        FALSE      On-line website
## 204397                        FALSE      On-line website
##        Complaint.Filed.Place Complaint.Contains.Stop...Frisk.Allegations
## 204393                  CCRB                                       FALSE
## 204394                  CCRB                                       FALSE
## 204395                  CCRB                                       FALSE
## 204396                  CCRB                                       FALSE
## 204397                  CCRB                                       FALSE
##        Incident.Location Incident.Year Encounter.Outcome
## 204393   Apartment/house          2016            Arrest
## 204394   Apartment/house          2016            Arrest
## 204395   Apartment/house          2016            Arrest
## 204396   Apartment/house          2016            Arrest
## 204397   Apartment/house          2016            Arrest
##         Reason.For.Initial.Contact Allegation.FADO.Type
## 204393 Execution of search warrant          Discourtesy
## 204394 Execution of search warrant          Discourtesy
## 204395 Execution of search warrant   Offensive Language
## 204396 Execution of search warrant   Offensive Language
## 204397 Execution of search warrant   Offensive Language
##        Allegation.Description
## 204393                   Word
## 204394                   Word
## 204395                 Gender
## 204396                 Gender
## 204397                 Gender
names(ccrb)
##  [1] "DateStamp"                                  
##  [2] "UniqueComplaintId"                          
##  [3] "Close.Year"                                 
##  [4] "Received.Year"                              
##  [5] "Borough.of.Occurrence"                      
##  [6] "Is.Full.Investigation"                      
##  [7] "Complaint.Has.Video.Evidence"               
##  [8] "Complaint.Filed.Mode"                       
##  [9] "Complaint.Filed.Place"                      
## [10] "Complaint.Contains.Stop...Frisk.Allegations"
## [11] "Incident.Location"                          
## [12] "Incident.Year"                              
## [13] "Encounter.Outcome"                          
## [14] "Reason.For.Initial.Contact"                 
## [15] "Allegation.FADO.Type"                       
## [16] "Allegation.Description"

Make plots

1. Boxplot: to see distribution of allegation types by number of year(s) the complaints stayed active based on the statistically summary

ccrb <- ccrb %>%
  mutate(num.yrs = Close.Year - Received.Year)    # create new var to how number of year(s) taken for a compaint to close

ggplot(ccrb, aes(Allegation.FADO.Type, num.yrs))+ 
  geom_boxplot() +
  ggtitle("Boxplot showing statistic summary of 
allegation type vs number of year of active complaints")

2. Boxplot: to understand distribution of each allegation type that occurred during the years based on the statistic summary

ggplot(ccrb, aes(y= Incident.Year, x= Allegation.FADO.Type)) +
 geom_boxplot() +
  ggtitle("Boxplot showing distribution of allegation types over the years")

3. Boxplot: to understand distribution of each borough of occurrence over the years based on the statistic summary

ggplot(ccrb, aes(x= Borough.of.Occurrence, y= Incident.Year))+
geom_boxplot() +
  labs(title='Boxplot showing distribution of borough of occurence over the years')

4. Barplot: to understand numbers of complaints occurred in each borough

summary(ccrb$Borough.of.Occurrence) # Look at table of statistic summary
##         Bronx      Brooklyn     Manhattan   Outside NYC        Queens 
##         49442         72215         42104           170         30883 
## Staten Island          NA's 
##          9100           483
ggplot(ccrb, aes(Borough.of.Occurrence)) +
  geom_bar(color= "white", fill= "tomato3") +
  ggtitle("Barplot showing numbers of complaints in each borough") +
  scale_x_discrete(labels = function(Borough.of.Occurrence) str_wrap(Borough.of.Occurrence, width = 10)) +
  scale_y_continuous(breaks = seq(0, 73000, 5000))

5. Barplot: to understand numbers of each mode used to file the complaints

ggplot(ccrb, aes(x = Complaint.Filed.Mode)) + 
  geom_bar(stat = 'count', color= "white", fill= "tomato3") + 
  labs(title = 'Barplot showing numbers of each mode used for filling complaints') +
  scale_x_discrete(labels = function(Complaint.Filed.Mode) str_wrap(Complaint.Filed.Mode, width = 10))

6. Barplot: to understand about locations where the incidents occurred over the years

ggplot(ccrb, aes(Incident.Location)) +
  geom_bar(color= "white", fill= "tomato3") +
  coord_flip() +
  ggtitle("Barplot showing numbers of incidents occurred in different locations") 

7. Stacked-bar plot: to understand numbers and relationship between investigation and VDO evidence on complaints

ggplot(ccrb, aes(x = Is.Full.Investigation, fill = Complaint.Has.Video.Evidence)) + 
  geom_bar(stat = 'count') + 
  labs(title = 'Stacked-bar plot showing joint distribution of investigation 
and VDO evidence on the complaints') + 
  scale_fill_discrete(name = 'Complaint Has Video Evidence')

8. Stacked-bar plot: to understand the joint distribution of encounter outcome and investigation in relation to the complaint

ggplot(ccrb, aes(x = Encounter.Outcome, fill = Is.Full.Investigation)) + 
  geom_bar(stat = 'count') + 
  labs(title='Stacked-bar plot showing joint distribution of Encounter Outcome anf Full Investigation') + 
  scale_fill_discrete(name = 'Full Investigation')

9. Stacked-bar plot: to understand the distribution of allegation types in each borough in relation to the complaints

ggplot(ccrb, aes(x= Borough.of.Occurrence, fill= Allegation.FADO.Type)) + 
  geom_histogram(stat ="count") + 
  labs (title = "Stacked-bar plot showing distribution of allegation type 
in each borough of occurence") +
  scale_x_discrete(labels = function(Borough.of.Occurrence) str_wrap(Borough.of.Occurrence, width = 10))
## Warning: Ignoring unknown parameters: binwidth, bins, pad

10. Line plot: to understand the trend of filed complaints over the years

ccrb1 <- ccrb %>%
  group_by(Received.Year) %>%
  summarize(total = n_distinct(UniqueComplaintId)) %>%
  select(Received.Year, total)

ggplot(ccrb1, aes(x = Received.Year, y = total)) + 
  geom_line() + 
  ggtitle('Line plot showing trend of filed complaints over years')

Summary

To summarise, Exploratory Data Analysis (EDA) helps us to roughly understand the data in order for us to be able to identify relationships between the interested variables, trends, patterns, problems, missing data, errors, and outliers. By looking at the structure of data, summarizing data with statistical analysis, and creating basic plots, the process allows us as the investigator to dicide about what is interesting in our data and what is not. The goal of EDA does not focus on inference or make presentable plots. Rather, it is to show data, obtain evidence, identify interesting patterns, and at the same time filter out variables that are not of our interest.