The objective of this assingment is to conduct an exploratory data analysis on some unfamiliar data. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Our objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
We begin by loading the tidyverse and readxl libraries.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(readxl)
Then, we read the data in R using the read_excel command from the readxl package and view the structure of the data.
ccrb <- read_excel("C:/Harrisburg University Classes/Sem 2 - Fall 2017/ANLY-512/Assignments/Problem Set 4/ccrb_datatransparencyinitiative.xlsx",
sheet = "Complaints_Allegations")
str(ccrb)
## Classes 'tbl_df', 'tbl' and 'data.frame': 204397 obs. of 16 variables:
## $ DateStamp : POSIXct, format: "2016-11-29" "2016-11-29" ...
## $ UniqueComplaintId : num 11 18 18 18 18 18 18 18 18 18 ...
## $ Close Year : num 2006 2006 2006 2006 2006 ...
## $ Received Year : num 2005 2004 2004 2004 2004 ...
## $ Borough of Occurrence : chr "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
## $ Is Full Investigation : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ Complaint Has Video Evidence : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Complaint Filed Mode : chr "On-line website" "Phone" "Phone" "Phone" ...
## $ Complaint Filed Place : chr "CCRB" "CCRB" "CCRB" "CCRB" ...
## $ Complaint Contains Stop & Frisk Allegations: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Incident Location : chr "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
## $ Incident Year : num 2005 2004 2004 2004 2004 ...
## $ Encounter Outcome : chr "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
## $ Reason For Initial Contact : chr "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
## $ Allegation FADO Type : chr "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
## $ Allegation Description : chr "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...
Let’s take a look at the summarized data using the summary function.
summary(ccrb)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location
## Mode :logical Length:204397
## FALSE:119856 Class :character
## TRUE :84541 Mode :character
##
##
##
## Incident Year Encounter Outcome Reason For Initial Contact
## Min. :1999 Length:204397 Length:204397
## 1st Qu.:2007 Class :character Class :character
## Median :2009 Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Allegation FADO Type Allegation Description
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
From the summarized data we can draw a few conclusions. There are a total of 204,397 allegations for only 69,492 unique complaints. This suggests that the unique complaint is for a unique incident and there can be multiple allegations per unique complaint. There is temporal data for the complaints and corresponding incident in terms of Close Year, Received Year and Incident Year. There is spatial data about the incident like incident Location and Borough of Occurence. There is circumstantial evidence about the incident - Video Evidence, Stop and Frisk Allegations and the Encounter Outcome. Finally, there is a variety of information on the complaint - filing mode and place, reason for initial contact, allegation type and description.
Based on this information, we conduct our visual data exploration in the following sections.
We begin by manipulating the data so that we can identify the unique incidents and the number of complaints per unique incident.
# Identify and summarize unique incidents
ccrb_UniqueIncidents <- ccrb %>%
group_by(`Incident Year`, UniqueComplaintId) %>%
summarise(Complaints.count = n()) %>%
select(`Incident Year`, UniqueComplaintId, Complaints.count) %>%
arrange(`Incident Year`, UniqueComplaintId)
ggplot(ccrb_UniqueIncidents,
aes(x = `Incident Year`, y = ..count..)) +
geom_bar() +
labs(x = "Incident Year",
y = "Incident Counts")
Majority of the incidents for the complaints reported happened in 2006 to 2009 and thereafter, the number of incidents kept on decreasing.
received_year <- ccrb %>% group_by(`Received Year`) %>%
summarise(Complaints = n_distinct(UniqueComplaintId)) %>%
select(`Received Year`, Complaints)
ggplot(received_year, aes(`Received Year`, Complaints)) +
geom_line(col = 'orange') +
labs(x = "Year", y = "Complaints") +
scale_x_continuous(breaks = seq(1999, 2016, 2)) +
scale_y_continuous(breaks = seq(0, 8000, 500)) +
theme(panel.grid.major = element_line(colour = "white", size = 0.5)) +
geom_point()
The complaints received also follow the same pattern as the incidents reported: max between 2006 to 2009 and a slowly decreasing trend thereafter.
ggplot(ccrb, aes(x = ccrb$`Incident Year`, y = ccrb$`Received Year`)) +
geom_point() +
geom_smooth(method = lm) +
labs(x = "Incident Year",
y = "Received Year") +
scale_x_continuous(breaks = seq(1999, 2016, 2)) +
scale_y_continuous(breaks = seq(1999, 2016, 2)) +
theme(panel.grid.minor = element_line(colour = "white", size = 0.5),
legend.position = "bottom")
Most of the complaints are reported 2 years after the incident.
ggplot(ccrb, aes(x = ccrb$`Incident Year`, y = ccrb$`Close Year`)) +
geom_point() +
geom_smooth(method = lm) +
labs(x = "Incident Year",
y = "Close Year",
title = "Incident Year vs. Close Year") +
scale_x_continuous(breaks = seq(1999, 2016, 2)) +
scale_y_continuous(breaks = seq(1999, 2016, 2)) +
theme(panel.grid.minor = element_line(colour = "white", size = 0.5),
legend.position = "bottom")
The difference between the Incident Year and the Close Year for most incidents are about 4 years.
ggplot(ccrb, aes(x = ccrb$`Received Year`, y = ccrb$`Close Year`,
col = ccrb$`Is Full Investigation`)) +
geom_point() +
geom_smooth(method = lm) +
labs(x = "Received Year",
y = "Close Year") +
scale_x_continuous(breaks = seq(1999, 2016, 2)) +
scale_y_continuous(breaks = seq(1999, 2016, 2)) +
guides(col = guide_legend(title = "Is Full Investigation?")) +
theme(panel.grid.minor = element_line(colour = "white", size = 0.5),
legend.position = "bottom")
The difference between the Close Year and Received Year for Full Investigations is higher than those of the non-full investigations.
Here, we take a look at how the incidents and complaints reported are geographically distributed.
ggplot(ccrb, aes(x = ccrb$`Borough of Occurrence`, ..count..)) +
geom_bar() +
labs(x = "Borough of Occurence",
y = "Incident Counts")
Highest incidents occur in Brooklyn followed by Bronx and Manhattan respectively.
ggplot(ccrb, aes(x = ccrb$`Complaint Filed Mode`, ..count..,
fill = ccrb$`Borough of Occurrence`)) +
geom_bar() +
labs(x = "Complaint Filed Mode",
y = "Incident Counts") +
theme(legend.position = "bottom") +
scale_fill_discrete(name = "Borough of Occurence")
Most complaints are received through Phone followed by the Call Processing System.
ggplot(ccrb, aes(x = ccrb$`Incident Location`,
fill = ccrb$`Borough of Occurrence`)) +
geom_bar(stat = "count") +
coord_flip() +
labs(x = "Incident Location",
y = "Incident Counts") +
scale_x_discrete(name = "Incidents Location") +
guides(fill = guide_legend(title = "Borough of Occurence")) +
theme(panel.grid.minor = element_line(colour = "white", size = 0.5),
legend.position = "bottom")
Most incidents happen in Street/highway followed by incidents happening in Apartment/house.
ggplot(ccrb, aes(x = ccrb$`Encounter Outcome`, ..count..,
fill = ccrb$`Borough of Occurrence`)) +
geom_bar() +
labs(x = "Encounter Outcome",
y = "Incident Counts") +
theme(legend.position = "bottom") +
scale_fill_discrete(name = "Borough of Occurrence")
The possibility of an incident leading to Arrest or Summons is more than 50 %.
Here, we take a look at the descriptives of the Complaints Received by Allegation Type, Video Evidence and Encounter Outcome.
ccrb %>% filter(`Received Year` > 2003) %>%
ggplot(aes(x = `Received Year`, fill = `Complaint Filed Place`)) +
geom_bar() +
labs(x = "Year", y = "Complaints") +
theme(legend.position = "right") +
guides(fill = guide_legend(title = "Complaint Place"))
Most complaints are received at CCRB fllowed by IAM.
ccrb %>% filter(`Received Year` > 2003) %>%
ggplot(aes(x = `Received Year`, fill = `Allegation FADO Type`)) +
geom_bar() +
labs(x = "Year", y = "Complaints") +
theme(legend.position = "bottom") +
scale_fill_discrete(name = "Allegation Fado Type") +
scale_x_continuous(breaks = seq(1999, 2016, 2))
The most frequent allegation type is Abuse of Authority followed by Force.
ggplot(ccrb, aes(x = ccrb$`Encounter Outcome`, ..count..,
fill = factor(ccrb$`Complaint Has Video Evidence`))) +
geom_bar() +
labs(x = "Encounter Outcome",
y = "Incident Counts") +
theme(legend.position = "bottom") +
scale_fill_discrete(name = "Has Video Evidence?")
Very few encounters have Video Evidence.
Exploratory Data Analysis (EDA) is the iterative process by which the data analyst gains a quantitative and qualitative understanding of a data set through asking and answering questions. Visual Data Exploration (VDE) is a part of EDA that explicitly refers to the high use of visual perception in the analytical cycle. In this exercise, we conducted the initial VDE on the New York’s Civilian Complaint Review Board (CCRB) and obtained a number of insights on the collected data as noted above.