EDA on Complaints allegations at NYC

The objective of this kernel is to conduct an exploratory data analysis of “Complaints_Allegations” data set.

The dats set is pulled from the following : link. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data. We will be exploring the data from NYC Data Transparency Initiative wherein they maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Our objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

Loading the dataset and checking for descriptive statistics:

library(ggplot2)
library(ggthemes)
library(readxl)
ccrb<- read_excel("C:/Users/Kowshik Kumar B/Documents/HU/512/Problem Set 4/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
head(ccrb)

## # A tibble: 6 x 16
##   DateStamp           UniqueComplaint~ `Close Year` `Received Year`
##   <dttm>                         <dbl>        <dbl>           <dbl>
## 1 2016-11-29 00:00:00               11         2006            2005
## 2 2016-11-29 00:00:00               18         2006            2004
## 3 2016-11-29 00:00:00               18         2006            2004
## 4 2016-11-29 00:00:00               18         2006            2004
## 5 2016-11-29 00:00:00               18         2006            2004
## 6 2016-11-29 00:00:00               18         2006            2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## #   Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## #   Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains Stop
## #   & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## #   Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial Contact` <chr>,
## #   `Allegation FADO Type` <chr>, `Allegation Description` <chr>

tail(ccrb)

## # A tibble: 6 x 16
##   DateStamp           UniqueComplaint~ `Close Year` `Received Year`
##   <dttm>                         <dbl>        <dbl>           <dbl>
## 1 2016-11-29 00:00:00            69476         2016            2016
## 2 2016-11-29 00:00:00            69476         2016            2016
## 3 2016-11-29 00:00:00            69476         2016            2016
## 4 2016-11-29 00:00:00            69476         2016            2016
## 5 2016-11-29 00:00:00            69476         2016            2016
## 6 2016-11-29 00:00:00            69476         2016            2016
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## #   Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## #   Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains Stop
## #   & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## #   Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial Contact` <chr>,
## #   `Allegation FADO Type` <chr>, `Allegation Description` <chr>

str(ccrb)

## tibble [204,397 x 16] (S3: tbl_df/tbl/data.frame)
##  $ DateStamp                                  : POSIXct[1:204397], format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num [1:204397] 11 18 18 18 18 18 18 18 18 18 ...
##  $ Close Year                                 : num [1:204397] 2006 2006 2006 2006 2006 ...
##  $ Received Year                              : num [1:204397] 2005 2004 2004 2004 2004 ...
##  $ Borough of Occurrence                      : chr [1:204397] "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is Full Investigation                      : logi [1:204397] FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint Has Video Evidence               : logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint Filed Mode                       : chr [1:204397] "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint Filed Place                      : chr [1:204397] "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint Contains Stop & Frisk Allegations: logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident Location                          : chr [1:204397] "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident Year                              : num [1:204397] 2005 2004 2004 2004 2004 ...
##  $ Encounter Outcome                          : chr [1:204397] "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason For Initial Contact                 : chr [1:204397] "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation FADO Type                       : chr [1:204397] "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation Description                     : chr [1:204397] "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...

summary(ccrb)

##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location  Incident Year 
##  Mode :logical                               Length:204397      Min.   :1999  
##  FALSE:119856                                Class :character   1st Qu.:2007  
##  TRUE :84541                                 Mode  :character   Median :2009  
##                                                                 Mean   :2010  
##                                                                 3rd Qu.:2012  
##                                                                 Max.   :2016  
##  Encounter Outcome  Reason For Initial Contact Allegation FADO Type
##  Length:204397      Length:204397              Length:204397       
##  Class :character   Class :character           Class :character    
##  Mode  :character   Mode  :character           Mode  :character    
##                                                                    
##                                                                    
##                                                                    
##  Allegation Description
##  Length:204397         
##  Class :character      
##  Mode  :character      
##                        
##                        
##

Firstly, let us look at the data by simple frequncy plots to understand the trend of how the complaints are registered by different factors.

h1=ggplot(data = ccrb, mapping = aes(x = ccrb$`Complaint Filed Mode`)) +
  geom_bar(stat = "count") +
  labs(x = "Complaints Filed Mode") +
  ggtitle("Frequency of Complaints Filed") +
  theme(plot.title = element_text(hjust = 0.5))
h2=ggplot(data = ccrb, mapping = aes(x = ccrb$`Allegation FADO Type`)) +
  geom_bar(stat = "count") +
  labs(x = "Allegation FADO type") +
  ggtitle("Frequency of Allegations FADO type") +
  theme(plot.title = element_text(hjust = 0.5))
h3=ggplot(data = ccrb, mapping = aes(x = ccrb$`Borough of Occurrence`)) +
  geom_bar(stat = "count") +
  labs(x = "Complaints Filed Mode") +
  ggtitle("Frequency of Complaints occurred at particular Borough") +
  theme(plot.title = element_text(hjust = 0.5))
h4=ggplot(data = ccrb, mapping = aes(x = ccrb$`Complaint Has Video Evidence`)) +
  geom_bar(stat = "count") +
  labs(x = "Video Evidence") +
  ggtitle("Frequency of Complaints has video evidence") +
  theme(plot.title = element_text(hjust = 0.5))
h5=ggplot(data = ccrb, mapping = aes(x = ccrb$`Incident Location`)) +
  geom_bar(stat = "count") +
  labs(x = "Incident Location") +
  ggtitle("Frequency of Incident Location") +
  theme(plot.title = element_text(hjust = 0.5))
h6=ggplot(data = ccrb, mapping = aes(x = ccrb$`Encounter Outcome`)) +
  geom_bar(stat = "count") +
  labs(x = "Encounter Outcome") +
  ggtitle("Frequency of Complaint's Encounter Outcome") +
  theme(plot.title = element_text(hjust = 0.5))
h1

h2

h3

h4

h5

h6

From the above plots, we can clearly interpret that there are more of Complaints filed from Phones, Call Processing System, and On-line website In-person but less from Fax and Mail. There are more Complaints registered in the area of Brooklyn, Manhattan and Bronx in comparison to other Boroughs of NY. There are very few Complaints with Video evidences. There are more cases involving with Abuse of authority, Force and Discourtesy.There are more of Arrests, No arrests or Summons and Summons for the Encounter Outcome.

Next, let us dive into deep visualizations from sound understanding of the multi-variate activity by using ‘fill’.

ggplot(ccrb, aes(x=ccrb$`Borough of Occurrence`, fill=ccrb$`Allegation FADO Type`)) +
  geom_histogram(stat="count") + 
  labs(title="Histogram of Incident Occurence by Borough and Type", x="Borough", y="Frequency of Incident Occurence") + 
  guides(fill=guide_legend(title = "Type of allegation"))

In the above plot, we can clearly interpret the ranking of Incidents occurred by Boroughs wherein Brooklyn ranks first with more of Abuse of authority type of allegation cases while others like Bronx and Manhattan follow next with the same pattern with respect to type of allegations.

ggplot(ccrb, aes(x=ccrb$`Borough of Occurrence`, fill=ccrb$`Complaint Has Video Evidence`)) +
  geom_histogram(stat="count") + 
  labs(title="Histogram of Complaints with video evidence by Boroughs", x="Boroughs", y="Frequency of Complaints") + 
  guides(fill=guide_legend(title = "Video Evidence"))

In the above plot, we can clearly visualize that there are very few Complaints which the Boroughs have video evidences.

ggplot(ccrb, aes(x=ccrb$`Complaint Filed Mode`, fill=ccrb$`Is Full Investigation`)) +
  geom_histogram(stat="count") + 
  labs(title="Histogram of Complaints filed mode with full Investigations", x="Complaints filed mode", y="Frequency of Investigations")+
  guides(fill=guide_legend(title = "Full Investigation"))

In the above plot, we can clearly see that most of the Complaints filed are by phone, call processing system, on-line website and In-person. Wherein less than 50% cases are performed full investigations if the complaints are filed by phone, call-processing system and On-line websites. However, if the complaints are filed by In-person or by Mail then they are performed with more than 60% cases.

p1=ggplot(ccrb, aes(x=ccrb$`Incident Year`, fill=ccrb$`Complaint Has Video Evidence`)) +
  geom_histogram(stat="count") + 
  labs(title="Histogram of Incident Year for Complaints with Video Evidence ", x="year", y="Frequency") + 
  guides(fill=guide_legend(title = "Video Evidence"))
p2=ggplot(ccrb, aes(x=ccrb$`Incident Location`, fill=ccrb$`Complaint Has Video Evidence`)) +
  geom_histogram(stat="count") + 
  labs(title="Histogram of Incident Location for Complaints with Video Evidence ", x="Incident Location", y="Frequency") + 
  guides(fill=guide_legend(title = "Video Evidence")) 
p1

p2

In the above plot, we can clearly visualize that there are very few Boroughs with video evidences and moreover they are monitored after 2010.

ggplot(ccrb, aes(x=ccrb$`Close Year`, fill=ccrb$`Is Full Investigation`)) +
  geom_histogram(stat="count") + 
  labs(title="Histogram of Investigations closed by Year", x="Year Closed", y="Frequency") + 
  guides(fill=guide_legend(title = "Full Investigation"))

In the above plot, we can clearly see that the trend is similar for most of the full investigations for the year of cases closed.

ggplot(ccrb, aes(x=ccrb$`Encounter Outcome`, fill=ccrb$`Borough of Occurrence`)) +
  geom_histogram(stat="count") + 
  labs(title="Histogram of Encounter Outcome with Borough Occurred ", x="Encounter Outcome", y="Frequency") + 
  guides(fill=guide_legend(title = "Boroughs"))

In the above plot, we can clearly visualize that the different Encounter outcomes with respect with the Boroughs which show important inferences how which Boroughs are contributing for most of the cast Encounter outcomes.

ggplot(ccrb, aes(x=ccrb$`Incident Year`, fill=ccrb$`Encounter Outcome`)) +
  geom_histogram(stat="count") + 
  labs(title="Histogram of Incident year Occurred by encounter outcome", x="Incident year", y="Frequency") + 
  guides(fill=guide_legend(title = "Encounter Outcome"))

In the above plot, we can clearly visualize how the Encounter Outcomes have been in between 2005 and 2015.

ggplot(ccrb, mapping = aes(ccrb$`Incident Year`,ccrb$`Received Year`)) +
    geom_point()+
  ggtitle("Relationship between Incident year and Received year") +
    xlab("Incident year")+
    ylab("Received year")+
    geom_smooth(method = 'lm', color = 'blue')+
  theme_minimal()

ggplot(ccrb, mapping = aes(ccrb$`Close Year`,ccrb$`Received Year`)) +
    geom_point()+
  ggtitle("Relationship between Incident closed year and Received year") +
    xlab("Closed year")+
    ylab("Received year")+
    geom_smooth(method = 'lm', color = 'blue')+
  theme_minimal()

In the above two plots, we have performed scatterplots with linear method wherein we could see the linear trend for both relationships of Closed year and Received year alongside Incident year and Received year.

Summary:

To summarize, the type of data obtained from the databases of complaints that fall within the Civilian Complain Review Board (CCRB) from NYC Data Transparency is Categorical type which is of majority. If we directly look into the xlsx file of the dataset then we will have to ponder for any clues for what is happening with the data. However, only after performing visualizations to the dataset’s variable we could see the trends and could clearly interpret the data. Hence, we can clearly conclude that the key to Data analysis is Data Vis.

EDA on Complaints allegations at NYC

Exploratory Data Analysis

Kowshik Kumar Bandameedipalli

2020-06-12