ANLY-512 Problem Set 4

Objectives

The objective of this assingment is to conduct an exploratory data analysis on some unfamiliar data. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Our objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

Load the data and review data structure

We begin by loading the tidyverse and readxl libraries.

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(readxl)

Then, we read the data in R using the read_excel command from the readxl package and view the structure of the data.

ccrb <- read_excel("C:/Harrisburg University Classes/Sem 2 - Fall 2017/ANLY-512/Assignments/Problem Set 4/ccrb_datatransparencyinitiative.xlsx", 
    sheet = "Complaints_Allegations")
str(ccrb)

## Classes 'tbl_df', 'tbl' and 'data.frame':    204397 obs. of  16 variables:
##  $ DateStamp                                  : POSIXct, format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num  11 18 18 18 18 18 18 18 18 18 ...
##  $ Close Year                                 : num  2006 2006 2006 2006 2006 ...
##  $ Received Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Borough of Occurrence                      : chr  "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is Full Investigation                      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint Has Video Evidence               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint Filed Mode                       : chr  "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint Filed Place                      : chr  "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint Contains Stop & Frisk Allegations: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident Location                          : chr  "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Encounter Outcome                          : chr  "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason For Initial Contact                 : chr  "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation FADO Type                       : chr  "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation Description                     : chr  "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...

Let’s take a look at the summarized data using the summary function.

summary(ccrb)

##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location 
##  Mode :logical                               Length:204397     
##  FALSE:119856                                Class :character  
##  TRUE :84541                                 Mode  :character  
##                                                                
##                                                                
##                                                                
##  Incident Year  Encounter Outcome  Reason For Initial Contact
##  Min.   :1999   Length:204397      Length:204397             
##  1st Qu.:2007   Class :character   Class :character          
##  Median :2009   Mode  :character   Mode  :character          
##  Mean   :2010                                                
##  3rd Qu.:2012                                                
##  Max.   :2016                                                
##  Allegation FADO Type Allegation Description
##  Length:204397        Length:204397         
##  Class :character     Class :character      
##  Mode  :character     Mode  :character      
##                                             
##                                             
##

From the summarized data we can draw a few conclusions. There are a total of 204,397 allegations for only 69,492 unique complaints. This suggests that the unique complaint is for a unique incident and there can be multiple allegations per unique complaint. There is temporal data for the complaints and corresponding incident in terms of Close Year, Received Year and Incident Year. There is spatial data about the incident like incident Location and Borough of Occurence. There is circumstantial evidence about the incident - Video Evidence, Stop and Frisk Allegations and the Encounter Outcome. Finally, there is a variety of information on the complaint - filing mode and place, reason for initial contact, allegation type and description.
Based on this information, we conduct our visual data exploration in the following sections.

Temporal distribution of Complaints and Incident

We begin by manipulating the data so that we can identify the unique incidents and the number of complaints per unique incident.

# Identify and summarize unique incidents
  ccrb_UniqueIncidents <- ccrb %>%
  group_by(`Incident Year`, UniqueComplaintId) %>%
  summarise(Complaints.count = n()) %>% 
  select(`Incident Year`, UniqueComplaintId, Complaints.count) %>%
  arrange(`Incident Year`, UniqueComplaintId)

Distribution of Incidents by Incident Year

ggplot(ccrb_UniqueIncidents, 
       aes(x = `Incident Year`, y = ..count..)) +
         geom_bar() +
  labs(x = "Incident Year",
       y = "Incident Counts")

Majority of the incidents for the complaints reported happened in 2006 to 2009 and thereafter, the number of incidents kept on decreasing.

Complaints Received per Year

received_year <- ccrb %>% group_by(`Received Year`) %>%
  summarise(Complaints = n_distinct(UniqueComplaintId)) %>%
  select(`Received Year`, Complaints)

ggplot(received_year, aes(`Received Year`, Complaints)) +
  geom_line(col = 'orange') +
  labs(x = "Year", y = "Complaints") +
  scale_x_continuous(breaks = seq(1999, 2016, 2)) +
  scale_y_continuous(breaks = seq(0, 8000, 500)) +
  theme(panel.grid.major = element_line(colour = "white", size = 0.5)) +
  geom_point()

The complaints received also follow the same pattern as the incidents reported: max between 2006 to 2009 and a slowly decreasing trend thereafter.

Incident Year vs. Received Year

ggplot(ccrb, aes(x = ccrb$`Incident Year`, y = ccrb$`Received Year`)) +
         geom_point() +
  geom_smooth(method = lm) +
  labs(x = "Incident Year",
       y = "Received Year") +
  scale_x_continuous(breaks = seq(1999, 2016, 2)) +
  scale_y_continuous(breaks = seq(1999, 2016, 2)) +
  theme(panel.grid.minor = element_line(colour = "white", size = 0.5),
        legend.position = "bottom")

Most of the complaints are reported 2 years after the incident.

Incident Year vs Close Year

ggplot(ccrb, aes(x = ccrb$`Incident Year`, y = ccrb$`Close Year`)) +
         geom_point() +
  geom_smooth(method = lm) +
  labs(x = "Incident Year",
       y = "Close Year",
       title = "Incident Year vs. Close Year") +
  scale_x_continuous(breaks = seq(1999, 2016, 2)) +
  scale_y_continuous(breaks = seq(1999, 2016, 2)) +
  theme(panel.grid.minor = element_line(colour = "white", size = 0.5),
        legend.position = "bottom")

The difference between the Incident Year and the Close Year for most incidents are about 4 years.

Close Year vs. Received Year

ggplot(ccrb, aes(x = ccrb$`Received Year`, y = ccrb$`Close Year`,
                  col = ccrb$`Is Full Investigation`)) +
         geom_point() +
  geom_smooth(method = lm) +
  labs(x = "Received Year",
       y = "Close Year") +
  scale_x_continuous(breaks = seq(1999, 2016, 2)) +
  scale_y_continuous(breaks = seq(1999, 2016, 2)) +
  guides(col = guide_legend(title = "Is Full Investigation?")) +
  theme(panel.grid.minor = element_line(colour = "white", size = 0.5),
        legend.position = "bottom")

The difference between the Close Year and Received Year for Full Investigations is higher than those of the non-full investigations.

Geographical Distribution of Complaints and Incidents

Here, we take a look at how the incidents and complaints reported are geographically distributed.

Distribution of Incidents by Borough of Occurence

ggplot(ccrb, aes(x = ccrb$`Borough of Occurrence`, ..count..)) +
  geom_bar() +
  labs(x = "Borough of Occurence",
       y = "Incident Counts")

Highest incidents occur in Brooklyn followed by Bronx and Manhattan respectively.

Distribution of Incidents by Complaint Filed Mode and Borough of Occurence

ggplot(ccrb, aes(x = ccrb$`Complaint Filed Mode`, ..count.., 
                 fill = ccrb$`Borough of Occurrence`)) +
  geom_bar() +
  labs(x = "Complaint Filed Mode",
       y = "Incident Counts") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name = "Borough of Occurence")

Most complaints are received through Phone followed by the Call Processing System.

Distribution by Incident Location and Borough of Occurence

ggplot(ccrb, aes(x = ccrb$`Incident Location`,
                 fill = ccrb$`Borough of Occurrence`)) +
  geom_bar(stat = "count") +
  coord_flip() +
  labs(x = "Incident Location",
       y = "Incident Counts") +
  scale_x_discrete(name = "Incidents Location") +
  guides(fill = guide_legend(title = "Borough of Occurence")) +
  theme(panel.grid.minor = element_line(colour = "white", size = 0.5),
        legend.position = "bottom")

Most incidents happen in Street/highway followed by incidents happening in Apartment/house.

Distribution by Borough of Occurence and Encounter Outcome

ggplot(ccrb, aes(x = ccrb$`Encounter Outcome`, ..count.., 
                 fill = ccrb$`Borough of Occurrence`)) +
         geom_bar() +
  labs(x = "Encounter Outcome",
       y = "Incident Counts") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name = "Borough of Occurrence")

The possibility of an incident leading to Arrest or Summons is more than 50 %.

Descriptives on Complaints Received

Here, we take a look at the descriptives of the Complaints Received by Allegation Type, Video Evidence and Encounter Outcome.

Complaints Received by Place Over Time

ccrb %>% filter(`Received Year` > 2003) %>%
  ggplot(aes(x = `Received Year`, fill = `Complaint Filed Place`)) +
  geom_bar() +
  labs(x = "Year", y = "Complaints") +
  theme(legend.position = "right") +
  guides(fill = guide_legend(title = "Complaint Place"))

Most complaints are received at CCRB fllowed by IAM.

Complaints Received by Allegation Type over Time

ccrb %>% filter(`Received Year` > 2003) %>%
  ggplot(aes(x = `Received Year`, fill = `Allegation FADO Type`)) +
  geom_bar() +
  labs(x = "Year", y = "Complaints") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name = "Allegation Fado Type") +
  scale_x_continuous(breaks = seq(1999, 2016, 2))

The most frequent allegation type is Abuse of Authority followed by Force.

Distribution by Encounter Outcome and Video Evidence

ggplot(ccrb, aes(x = ccrb$`Encounter Outcome`, ..count.., 
                 fill = factor(ccrb$`Complaint Has Video Evidence`))) +
         geom_bar() +
  labs(x = "Encounter Outcome",
       y = "Incident Counts") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name = "Has Video Evidence?")

Very few encounters have Video Evidence.

Summary

Exploratory Data Analysis (EDA) is the iterative process by which the data analyst gains a quantitative and qualitative understanding of a data set through asking and answering questions. Visual Data Exploration (VDE) is a part of EDA that explicitly refers to the high use of visual perception in the analytical cycle. In this exercise, we conducted the initial VDE on the New York’s Civilian Complaint Review Board (CCRB) and obtained a number of insights on the collected data as noted above.