Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:

# This is a top section

## This is a subsection

Your final document should include at minimum 10 visualization. Each should include a brief statement of what they show.

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

library(readxl)
df <- read_excel("/Volumes/PNY 16 GB/HU Coursework/ANLY 512/HW 5/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
head(df)

## # A tibble: 6 x 16
##    DateStamp UniqueComplaintId `Close Year` `Received Year`
##       <dttm>             <dbl>        <dbl>           <dbl>
## 1 2016-11-29                11         2006            2005
## 2 2016-11-29                18         2006            2004
## 3 2016-11-29                18         2006            2004
## 4 2016-11-29                18         2006            2004
## 5 2016-11-29                18         2006            2004
## 6 2016-11-29                18         2006            2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## #   Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## #   Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains
## #   Stop & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## #   Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial
## #   Contact` <chr>, `Allegation FADO Type` <chr>, `Allegation
## #   Description` <chr>

str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    204397 obs. of  16 variables:
##  $ DateStamp                                  : POSIXct, format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num  11 18 18 18 18 18 18 18 18 18 ...
##  $ Close Year                                 : num  2006 2006 2006 2006 2006 ...
##  $ Received Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Borough of Occurrence                      : chr  "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is Full Investigation                      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint Has Video Evidence               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint Filed Mode                       : chr  "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint Filed Place                      : chr  "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint Contains Stop & Frisk Allegations: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident Location                          : chr  "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Encounter Outcome                          : chr  "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason For Initial Contact                 : chr  "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation FADO Type                       : chr  "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation Description                     : chr  "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...

library(tidyverse)
library(ggplot2)

Vis.1 Pie chart of number of incidents per borough

p_pie <- ggplot(data = df) + 
  geom_bar(mapping = aes(x = factor(1), fill = Borough.of.Occurrence), width = 1) + 
  coord_polar(theta = "y") 
p_pie

The Pie-chart above indicates that among all boroughs of New York, highest number of incidents happened in Brooklyn, followed by the Bronx, manhattan and queens. Staten Island was the borough where least number on incident happened.

Vis 2. Number of Incidents occured per borough

p3 <- ggplot(data = df, aes(x = Borough.of.Occurrence)) +
  geom_bar() + 
  theme_classic() +
  labs(title = "Number of Incidents occured in each Borough") + 
  theme(plot.title = element_text(size = 18, color = "red", face = "bold" ,hjust = 0.5, family = "American Typewriter"), 
        axis.text.x = element_text(size = 10, vjust = 0.5, angle = 60))
p3

The barplot above shows the same information as pie chart however, it was created to extrapolate more information by adding further layers with ggplot2 to try and gain more understanding of the data.

Vis.3 Allegation FADO in each borough

p4 <- ggplot(data = df, aes(x = Borough.of.Occurrence, fill = Allegation.FADO.Type)) +
  geom_bar() + 
  theme_classic() +
  labs(title = "Number of Incidents occured in each Borough by Type") + 
  theme(plot.title = element_text(size = 16, color = "red", face = "bold", family = "American Typewriter"), 
        axis.text.x = element_text(size = 10, vjust = 0.5, angle = 60))
p4

The barplot above shows number of incidents occured in each borough according its type FADO. The first impression of the graph reveals that abuse of authority type of allegation was the most commonly reported across 5 boroughs of NY followed by Force as a type of FADO allegation. The offensive language was the least commonly happened incident based on these allegations. Furthermore, the percentage of different kinds of FADO allegation does not seem to vary between boroughs.

Vis 4. Location of incident happened by region (borough)

p5 <- ggplot(data = df, aes(x = Borough.of.Occurrence, fill = Incident.Location)) +
  geom_bar() + 
  theme_classic() +
  labs(title = "Number of Incidents occured in each Borough") + 
  theme(plot.title = element_text(size = 16, color = "red", face = "bold", family = "American Typewriter"), 
        axis.text.x = element_text(size = 10, vjust = 0.5, angle = 60), 
        legend.title = element_blank())
p5

The vis.4 was further filtered by where the incident happened. The barplot reveals that the most number of incidents happened on Street/highway followed by apartmnent/house, the trend which remains same across all boroughs of NYC.

Vis 5. Allegation of FADO type by region (Borough)

require(cowplot)
plot_grid(p4, p13, align = "v", ncol = 1, labels = c("Barplot", "Heatmap"))

The Barplot and the heatmap above indicat same thing but in a visually different persepctives. As it is clear from barplot that the Abuse of Authority was most predominant in Brooklyn compared to other boroughs, the same thing is being represented in heatmap where lighter the color, the more the number of incidences and vice versa.

Vis 6. Allegation of FADO type per region (Borough). (separate barplots for each region)

p9 <- ggplot(df, aes(x= Allegation.FADO.Type, fill =Allegation.FADO.Type)) + 
  geom_histogram(binwidth=200, stat = "count") + 
  facet_wrap(~ df$Borough.of.Occurrence) + 
  labs(title = "Allegation FADO Type Per Region") + 
  theme_gray() +
  theme(axis.text.x = element_text(size =8, vjust = 0.5, angle = 75), 
        plot.title = element_text(size = 10, color = "red", face = "bold" ,hjust = 0.5, family = "American Typewriter")) 
p9

The barplots of FADO type of allegations has been plotted separately by each borough of NYC which buttress the claim of Abuse of Authority being most common in Brooklyn followed by Bronx and Manhattan and Queens.

Vis.7 Outcome of each FADO type allegation (separate barplots for each FADO)

p11 <- ggplot(df, aes(x= Encounter.Outcome, fill = Encounter.Outcome)) + 
  geom_histogram(binwidth=200, stat = "count") + 
  facet_wrap(~ Allegation.FADO.Type) + 
  labs(title = "Outcome for Each FADO Type Allegation") +
  theme_gray() +
  theme(axis.text.x = element_text(size =8, vjust = 0.5, angle = 75), 
        plot.title = element_text(size = 10, hjust = 0.5, color = "red", face = "bold", family = "American Typewriter")) 
  
p11

The barplot above shows what was the outcome for each type of FADO allegations. The most number of arrests happened in response to an allegation of Force, immediately followed by Abuse of Authority. Also the most number of Abuse of allegations were let off without an arrest or summons.

Vis 8. Outcome of each FADO type allegation per region (Borough)

p14 <- ggplot(df, aes(x= Encounter.Outcome, fill = Encounter.Outcome)) + 
  geom_histogram(binwidth=200, stat = "count") + 
  facet_wrap(~ Borough.of.Occurrence) + 
  labs(title = "Outcome of Each FADO Type Allegation Per Region") +
  theme_gray() +
  theme(axis.text.x = element_text(size =8, vjust = 0.5, angle = 75), 
        plot.title = element_text(size = 10, hjust = 0.5, color = "red", face = "bold", family = "American Typewriter")) 
p14

If we compare the barplot above with Vis.6, we can conclude that the most number of arrest too were happened in Brooklyn as were the number of incidents. It is also noteworthy that the most number of accused were let off without any arrest or summons in Brooklyn too compared to other regions.

Vis 9. Allegation FADO type filtered by outcome for each type of allegation FADO allegation

p10 <- ggplot(data = df, aes(x = Allegation.FADO.Type, fill = Encounter.Outcome)) +
  geom_bar() + 
  theme_classic() +
  labs(title = "Number of Incidents occured in each Borough by zencounter Outcome") + 
  theme(plot.title = element_text(size = 16, color = "red", face = "bold", family = "American Typewriter"), 
        axis.text.x = element_text(size = 10, vjust = 0.5, angle = 60))
p10

The barplot above again shows that the most number of arrests happened in response to accusations of Force than any other kind of FADO allegation. Moreover, the most number of kind of incidents were accused were let off without an arrest or summons was Abuse of Authority.

Vis 10. Outcome of each FADO type allegation per region: Denisty Plot (borough)

p12 <- ggplot(df, aes(x= Borough.of.Occurrence, color = Allegation.FADO.Type)) + 
  geom_density() + 
  labs(title = "Allegation FADO type per region") +
  theme(axis.text.x = element_text(size =10, vjust = 0.5, angle = 60), 
        plot.title = element_text(size = 10, color = "red", face = "bold" ,hjust = 0.5, family = "American Typewriter")) 
p12

The density plot above shows the same thing as some of the barplots above however, the perception of visualizations seems to be poorer compared to barplots.

Vis 11. Encounter outcome of each incident happened

p1 <- ggplot(data = df, aes(x=Incident.Year, fill = Encounter.Outcome)) + 
  geom_bar() + 
  theme_classic() +
  labs(title = "Encounter Outcome Per Year") + 
  scale_x_continuous(breaks = seq(1998, 2017, 1)) +
  theme(plot.title = element_text(size = 16, color = "red", face = "bold" ,hjust = 0.5, family = "American Typewriter"), 
        axis.text.x = element_text(size = 10, vjust = 0.5, angle = 60), 
        legend.position = "bottom", legend.box = "horizontal")
p1

There seems to be no clear trend in terms of number of outcomes to an encounter was more or less in one years compared to another.

Vis 12. Video Evidence per year

p7 <- ggplot(data = df, aes(x = Incident.Year, fill = Complaint.Has.Video.Evidence)) +
  geom_bar() + 
  theme_classic() +
  labs(title = "Video Evidence and Incident Year") + 
  scale_x_continuous(breaks = seq(1998, 2017, 1)) +
  theme(plot.title = element_text(size = 16, color = "red", face = "bold", family = "American Typewriter"), 
        axis.text.x = element_text(size = 10, vjust = 0.5, angle = 60))
p7

cor(df$Incident.Year, df$Complaint.Has.Video.Evidence)

## [1] 0.2993368

The bar graph above shows which incident year has how many number of video evidences. The trend which started from year 2010, the number of complaints backed by video evidences kept increasing year after year. Year 2016 which seems to show less number of video evidences compared to a year before can be justified by less overall number of complaints in 2016 compared to 2015. The visualization is further backed by positive correlation coefficient of 0.3 between incident year and number of complaints with video evidences.

Vis 13. Video Evidence per Region (Borough)

p6 <- ggplot(data = df, aes(x = Incident.Year, fill = Is.Full.Investigation)) +
  geom_bar() + 
  theme_classic() +
  labs(title = "Completeness of Investigation Per Year") + 
  scale_x_continuous(breaks = seq(1998, 2017, 1)) +
  theme(plot.title = element_text(size = 18, color = "red", face = "bold" ,hjust = 0.5, family = "American Typewriter"), 
        axis.text.x = element_text(size = 10, vjust = 0.5, angle = 60))
p6

cor(df$Is.Full.Investigation, df$Complaint.Has.Video.Evidence)

## [1] 0.1527469

To find out whether having video evidences had an impact of completion of an investigation, the barplot above shows that although total number of complaints were less poast year 2010, the more number of complaints could get investigation completed which is further backed by a position correlation coefficient of 0.15 between the year investigation was completed and the year which had video evidences to back the complaint.

SUMMARY

To summarize the whole exploratory data analysis through visualizations above, Brooklyn reported the most number of incidences followed by The Bronx compared to other boroughs of NYC. This could be well visualized using pier chart and barplots using the package ggplot2. It was also possible to visualize what kind of FADO type of allegations reported in each of these boroughs; whether a particular kind of FADO allegation (Force, Abuse of Authority, Discourtesy, Offensive Language) was more or less common in complains received in a specific borough. However, the trend/pattern of type of allegations seemed to be similar across the boroughs of NYC.

Furthemore, the location these incidents happened could be visualized per region of NYC and it revealed that the street/highway followed by the Apartment/house were the most common places across all boroughs of NYC where th number of incidents happened among other places. Through heatmap and barplots, it could be clearly visualized that the Brooklyn, which had the most number of incidents reported, also witnessed the most number of Abuse of Authority kind of allegation.

Using the barplots, faceted by FADO allegations, it was possible to visualize what kind of outcome came out at the end of investigation. It shows that the complaints with allegation of Force resulted in most number of arrests followed by complaints with Abuse of Authority. Furthermore, the complaints with allegation of Abuse of Authority resulted in the most number of let offs with no arrest or summons. These could also be visualized per region where Brooklyn reported the most number of arrests which makes sense as it is the borough with the most number of FADO allegations, followed by The Bronx. The outcome could be further filtered into years as to which year reprted the kind and extent of output.

Lastly, using barplots, it was possible to visualize that post year 2010, the number of complaints backed by video evidences kept increasing which can be easily linked to the mobile phone camera revolution during this time. The victims must have easy access to the technology to record the incidence which could back their allegation. This trend of using video evidences kept going up year after another which was further backed by the positive correlation coefficient of 0.30 between year of incidence and the number of video evidences used to back the allegations.

Now whether having video evidences affected the completion of investigation could also be visualized and the positive correlation coefficiet of 0.15 between the year with investigation was completed and the year with number of video evidences further bakced the claim that having video evidences must have helped in part to complete the investigation in timely fashion.

ANLY 512 - Problem Set 5

Exploratory Data Analysis

Sandip Darji

09-11-2017