Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

library(readxl)
library(ggplot2)
library(ggthemes)

ccrb <- read_excel("Data/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")

summary(ccrb)

##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location 
##  Mode :logical                               Length:204397     
##  FALSE:119856                                Class :character  
##  TRUE :84541                                 Mode  :character  
##                                                                
##                                                                
##                                                                
##  Incident Year  Encounter Outcome  Reason For Initial Contact
##  Min.   :1999   Length:204397      Length:204397             
##  1st Qu.:2007   Class :character   Class :character          
##  Median :2009   Mode  :character   Mode  :character          
##  Mean   :2010                                                
##  3rd Qu.:2012                                                
##  Max.   :2016                                                
##  Allegation FADO Type Allegation Description
##  Length:204397        Length:204397         
##  Class :character     Class :character      
##  Mode  :character     Mode  :character      
##                                             
##                                             
##

Visualization 1: How were the complaints filed over time?

ccrb_vis1 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Incident Year`,fill=ccrb$`Complaint Filed Mode`)) +
  geom_bar(position = "fill") +
  scale_fill_discrete(name="Complaint Filed Mode") + 
  labs(x="Incident Year", y="Proportion of Incident (%)", title="Different modes of complaints being filed during 1999-2016") + 
  xlim(1999,2016)
           
ccrb_vis1

– Methods of filing complaints have been changed over time. Fax was the main source of communication in the early days while the use of on-line website has been increasing. However, the majority of complaints was done through the use of phone and call processing systems combined.

Visualization 2: What and where was the incident trend?

ccrb_vis2 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Incident Year`,fill=ccrb$`Borough of Occurrence`)) +
  geom_bar() + 
  scale_fill_discrete(name="Borough of Occurence") + 
  labs(x="Incident Year", y="Number of Incidents", title="No. of incidents in each borough during 1999-2016") + 
  xlim(1999,2016)
           
ccrb_vis2

– Since 2007, the overall number of incidents had been decreasing across different boroughs. However, looking only at this graph, it is still a bit unclear to pinpoint between Bronx and Brookly regarding the borough with the most incidents.

Visualization 3: Which borough had the most incidents?

ccrb_vis3 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Borough of Occurrence`)) +
  geom_bar() +
  labs(x="Borough of Occurence", y="Number of Incidents", title="Incidents by Borough") 

ccrb_vis3

– Now we could see that Brooklyn was the borough with the highest incidents occuring, followed by Bronx and Manhattan.

Visualization 4: Where did the incident take place?

ccrb_vis4 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Borough of Occurrence`,fill=ccrb$`Incident Location`)) +
  geom_bar(position = "fill") +
  scale_fill_discrete(name="Incident Location") + 
  labs(x="Borough of Occurence", y="Proportion of Incident (%)", title="Proportion of Incident Locations in Each Borough") 

ccrb_vis4

– The majority of the allegation events occured on street/highway across different boroughs. Manhattan had a higher share of incidents took place in subway station/train than others. Also, we could see a larger share of incidents happened in apartment/house in locations outside NYC.

Visualization 5: What was the timeliness of incidents being reported?

ccrb_vis5 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Incident Year`, y=ccrb$`Received Year`))+
  geom_point() +
  geom_smooth() +
  labs(x="Incident Year", y="Received Year", title="Relationship between Incident Year and Received Year")

ccrb_vis5

## `geom_smooth()` using method = 'gam'

– It seemed like the Incident Year and Received Year established a linear relationship, meaning that the incidents were reported in the same year of occurences.

Visualization 6: How long did it take from receiving to closing the case?

ccrb_vis6 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Received Year`, y=ccrb$`Close Year`))+
  geom_point() +
  geom_smooth() +
  labs(x="Received Year", y="Close Year", title="Relationship between Received Year and Close Year")

ccrb_vis6

## `geom_smooth()` using method = 'gam'

– Received Year and Close Year, however, did not seems to establish a clear linear relationship until after 2005. There were cases that incidents reported before 2005 but could not get closed within the same year. The situation had been improved since 2005 where we could see the linear trend as the base line. However, we could also see that there were quite a number of cases that took more than a year to get closed, meaning that they might take longer investigation time to reach conclusion.

Visualization 7: What was the encounter outcome trend in each Borough?

ccrb_vis7 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Borough of Occurrence`,fill=ccrb$`Encounter Outcome`)) +
  geom_bar(position = "fill") +
  scale_fill_discrete(name="Encounter Outcome") + 
  labs(x="Borough of Occurence", y="Proportion of Incident (%)", title="Proportion of Encounter Outcome in Each Borough") 

ccrb_vis7

– In Bronx, Brookly and Staten Island, the likelihood of encounters ended up with being arrested was highter than other boroughs. Whereas outside NYC, no arrest or summons represented the majority of the encounter outcomes.

Visualization 8: Did full Investigation have any impact on the encounter outcomes?

ccrb_vis8 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Encounter Outcome`,fill=ccrb$`Is Full Investigation`)) +
  geom_bar() +
  scale_fill_discrete(name="Full Investigation [T/F]") + 
  labs(x="Encounter Outcome", y="Number of Incidents", title="Encounter Outcome and Presence of Full Investigation") 

ccrb_vis8

– Arrest represented the highest encounter outcome, followed by no arrest or summons. We could see that for more serious outcome such as arrest or summons when charges needed to be clearly stated/identified, approximately 50% or more of those incidents involved full investigation. However, for incidents with no arrest or summons, the use of full investigation had been less.

Visualization 9: Did video evidence have any impact on the encounter outcomes?

ccrb_vis9 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Encounter Outcome`,fill=ccrb$`Complaint Has Video Evidence`)) +
  geom_bar() +
  scale_fill_discrete(name="Video Evidence [T/F]") + 
  labs(x="Encounter Outcome", y="Number of Incidents", title="Encounter Outcome and Presence of Video Evidence") 

ccrb_vis9

– Most of cases, regardless of the encounter outcomes, had video evidence when filing complaints.

Visualization 10: What was the trend of allegation FADO type for each encounter outcome?

ccrb_vis9 <- ggplot(data = ccrb, 
                    aes(x=ccrb$`Encounter Outcome`,fill=ccrb$`Allegation FADO Type`)) +
  geom_bar(position = "fill") +
  scale_fill_discrete(name="Allegation FADO Type") + 
  labs(x="Encounter Outcome", y="Proportion of Incident (%)", title="Encounter Outcome and Allegation FADO Type") 

ccrb_vis9

– For incidents ended up with arrest outcome, force represented the majority of allegation FADO type. However, for cases with no arrest or summons or just summons, abuse of authority was the main theme.

Conclusion

– When performing Exploratory Data Analysis (EDA) on the CCRB data, we are able to analyze several different aspects of the complaints allegations. we could see the trend of the incidents, the borough and locations where the incidents occured, and the presence of evidence being used or practiced being done for each type of encounter outcomes. All these could lead to more specific questions that worth time and efforts for further analysis.

– EDA is a useful concept that could be used with data that we are not familiar with. It allows us to explore what is in the data and let the data communicates the stories based on the questions that we would like to know. Through graphic visualization, we are able to understand the data, see the answers of what we are looking for or ask additional questions and perform further analysis.

ANLY 512-53 Problem Set 4

Exploratory Data Analysis

Nutthaporn Amnuayporn

December 11, 2017