Exploratory Data Analysis


Charles Palmer

Between the years of 1999 and 2016, 204397 complaints and allegations of excessive force, abuse of authority, discourtesy or offensive language were reported across the five boroughs of New York City. These alligations and complaints were reported to and collected by the NYC Civilian Complaint Review Board.

For this Exploratory Data Analysis, I’ve presented 10 data visualizations to summerize a few of the primary characteristics within the dataset.


summary(ccrb_raw)
##   DateStamp         UniqueComplaintId   Close Year   Received Year 
##  Length:204397      Min.   :    1     Min.   :2006   Min.   :1999  
##  Class :character   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Mode  :character   Median :34794     Median :2010   Median :2009  
##                     Mean   :34778     Mean   :2010   Mean   :2010  
##                     3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##                     Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                        NA's :0               NA's :0                     
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location 
##  Mode :logical                               Length:204397     
##  FALSE:119856                                Class :character  
##  TRUE :84541                                 Mode  :character  
##  NA's :0                                                       
##                                                                
##                                                                
##  Incident Year  Encounter Outcome  Reason For Initial Contact
##  Min.   :1999   Length:204397      Length:204397             
##  1st Qu.:2007   Class :character   Class :character          
##  Median :2009   Mode  :character   Mode  :character          
##  Mean   :2010                                                
##  3rd Qu.:2012                                                
##  Max.   :2016                                                
##  Allegation FADO Type Allegation Description
##  Length:204397        Length:204397         
##  Class :character     Class :character      
##  Mode  :character     Mode  :character      
##                                             
##                                             
## 

Vis 1: Incident Reports
A historgram of the incident complaints received across the five boroughs of New York City between the years. This image shows two things; 1) incident reports have steadily declined since 2007 and 2) although the range of incidents occurred between 1999 and 2016, incidents weren’t officially reported until 2004.

hist(ccrb_boros$`Received Year`, main="#1 Histogram for Received Year", xlab="Received Year", border="black", breaks=18, col="green", las=3)

Vis 2: Summary of all Incidents by Borough in 2010
The original data provided by the Civilian Complain Review Board included complaints from the five boroughs, a category titled ‘Outside NYC’ and ‘NA’. For this vizualizations Outside NYC and NA have been removed to focus on the known origin of the complaints. I am also only presenting the 2010 reports. The goal was to compare this data against the 2010 Census report to see if the density of reporting match the borough populations. Unfortunately I could not layer this additional information on the bar chart, so I have included the census data (as a percentage of the region) on a separate chart.

It is clear that Brooklyn has the most complaints and this makes sense the borough has 30.6% of the population. Surprisingly, Queens, at 27.3% of the population, should see a higher number of reports. Much higher than the results show.

p <- ggplot(ccrb_2010, aes(ccrb_2010$'Borough of Occurrence')) + geom_bar(stat="count") + labs(title = "#2 2010 Complaints and Allegations", x = "Boroughs", y = "", subtitle = "An accounting of all allegations by NYC Borough", caption = "(Source: Civilian Complain Review Board (CCRB))") + theme_light()
p

dat1 <- data.frame(census = c(16.9, 30.6, 19.4, 27.3, 5.7), boro = factor(c("Bronx","Brooklyn","Manhattan","Queens","Staten Island"))
)

c <- ggplot(data=dat1, aes(x=boro, y=census)) +
    geom_line() +
    geom_point()

c
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

Vis 3: Filing Mode
Direct telephone calls are the primary source for most complaints since 2004. Given the number of cellphone in use since 2008, I am not surprised by these results.

p <- ggplot(ccrb_2004, aes(ccrb_2004$'Complaint Filed Mode')) + geom_bar() + labs(title = "#3 Filing Mode", x = "Boroughs", y = "Number of reports", subtitle = "A comparison of how complaints were filed", caption = "(Source: Civilian Complain Review Board (CCRB))")
p

Vis 4: Filing modes over time
Where Vis 3 showed that most complaints have been received by phone, this stacked bar chart highlights that the reporting of incidents has been declining over the last decade. Although this image does not provide any new insights, it does reinforce the results identified in Vis 1 & 3. The addition of color helps illustrate the point.

p <- ggplot(data = ccrb_2004, aes(ccrb_2004$`Received Year`)) + geom_bar(aes(fill = ccrb_2004$`Complaint Filed Mode`)) + scale_fill_discrete(name = "Reporting type") + labs(title = "#4 Filing Mode over time", x = "Year Complaint Received", y = "Report count", subtitle = "A breakdown of how complaints were filed", caption = "(Source: Civilian Complain Review Board (CCRB))") + theme_light()
p


Vis 5: Summary of Stop and Frisk Allegations by Borough
Stop and frisk has been a very controversial issue over the years. When compared with Viz 2, this chart shows that the occurance of these types of reports is consistent with the overall frequency of all reporting. Surprising, I would have expected a steeper dropoff of these incidents given that Stop and Frisk policy was deemed unconstitutional by a federal judge in August of 2013. But as the chart shows, the practice was still being employed.

p <- ggplot(ccrb_frisk, aes(ccrb_frisk$`Received Year`)) + geom_bar() + scale_fill_discrete(name = "Reporting type") + labs(title = "#5 Stop & Frisk Allegations", x = "Year Complaint Received", y = "Report count", subtitle = "A comparison of Stop & Frisk complaints over time", caption = "(Source: Civilian Complain Review Board (CCRB))") + theme_light()
p

Vis 6: Allegations by Borough
This visualization illustrates the breakdown of various allegation types for each borough. It shows that ‘Abuse of Authority’ represents the bulk of complaints from all five boroughs. Personally, I was surprised by the consistency of the complaints across the region.

p <- ggplot(ccrb_boros, aes(ccrb_boros$`Borough of Occurrence`, fill = ccrb_boros$`Allegation FADO Type`)) + geom_bar(position = "fill") + labs(title = "#6 Allegations by Borough", x = "Boroughs", y = "Percentage") + scale_fill_discrete(name = "Allegation Types")
p

Vis 7: Presence of video evidence
With the proliferation of cellphones, 2011-2015 saw a rise in the release of supporting video evidence. Although 2016 saw a reduction in this evidence, continued study would be needed to determine if this is a new trend. It should also be noted that in April of 2017, the first group of NYC Police Officers were issued “body cams”. With these devices in place we should see a marked increase in video evidence in the coming years.

p <- ggplot(data = ccrb_2004, aes(as.factor(ccrb_2004$`Received Year`))) + geom_bar(aes(fill = ccrb_2004$`Complaint Has Video Evidence`)) + labs(title = "#7 Presence of Video Evidence",x = "Year Complaint Received", y="") + scale_fill_manual(name = "Video evidence", values=c("#999999", "#E69F00"))
p

Vis 8: Encounter outcome
Here we have an illustration representing a summary of the outcome of the 203,744 reports from the five boroughs.

p <- ggplot(ccrb_boros, aes(x=ccrb_boros$'Encounter Outcome')) + geom_bar() + labs(title = "#8 Encounter outcome",x = "", y="", subtitle = "An illustration of the encounter outcomes")
p

Vis 9: Incident Locations
Another summary indicating where the incidents took place. Here we see that and incident is more likely to be reported in on a public street or highway than anywhere else.

ggplot(ccrb_boros, aes(ccrb_boros$'Incident Location')) + geom_histogram(stat="count") + coord_flip() + labs(title = "#9 Incident Locations", x = "", y="Incidents per location", subtitle = "A general description of the incident location") 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Vis 10: Incidents by Borough of Occurrence
And finally, here is a box-plot comparing the incidents per year per borough.

p <- ggplot(ccrb_boros, aes(y=ccrb_boros$`Incident Year`,x=ccrb_boros$`Borough of Occurrence`)) + geom_boxplot() + labs(title = "#10 Incidents across the Boroughs", y = "Incident Years", x = "New York Boroughs") + coord_flip()
p

Summary

The ‘Exploratory Data Analysis’ provides a quick way to gain insights, clues, and misconceptions about a dataset. It’s considered a precursor, because the quick glance usually helps to identify areas of concern or focus more deeper analysis. From the images created here we can see a number of insights: