Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

# This is a top section

## This is a subsection

Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

Import Library / Load Data / Check Data

# Read data
library(ggplot2)
library(ggthemes)
library(tidyverse)
library(ggmap)
library(ggalt)

theme_set(theme_bw())

ccrb <- read.csv("C:/Users/whe001/Documents/ccrb_datatransparencyinitiative.csv", header = TRUE)
ccrb <- na.omit(ccrb)

Visualization

Incident count by years

Maybe due to the nature of data collection, we only start to see sufficient data from 2014. After that, 2016 and 2017 are the summit year of cases. After that case count declines gradually.

# Incident count by Incident Year

g <- ggplot(ccrb, aes(Incident.Year))
g + geom_bar(stat = "count", width = 0.5) +
  labs(title = "Incident Count by Incident Year",
       caption = "Source: CCRB data transparency initiative")

Incident by year by if there is video evidence or not

Most cases dont have Video evidence, and maybe due to the technology advancement, the small amount of video evidence started to appear since 2011.

# Incident count by Incident Year group by Video Survillance or not
g <- ggplot(ccrb, aes(Incident.Year))
g + geom_bar(stat = "count", width = 0.5) +
  labs(title = "Incident Count by Incident Year",
       caption = "Source: CCRB data transparency initiative") +
  facet_grid(Complaint.Has.Video.Evidence ~ .)

Encounter Outcome by year

Most cases results in Arrest / no Arrest or Summons. Smaller cases result in Summons. In recent years, no arrest cases become the most cases among all.

# Encounter Outcome by year?
g + geom_bar(stat = "count", aes(fill =Encounter.Outcome), position = "dodge", width = 0.5) +
  labs(title = "Incident Count by Incident Year",
       caption = "Source: CCRB data transparency initiative")

% of fully investigated cases each year

What seems to be sad here is, the percentage of not fully investigated cases is taking bigger percentage gradually.

# The percentage of them in each year has been fully investigated?
full_invest <- ccrb %>% 
  group_by(Incident.Year, Is.Full.Investigation) %>% 
  summarise(count = n()) %>% 
  mutate(perc = count/sum(count))

brks <- c(0, 0.25, 0.5, 0.75, 1)

Full <- ggplot(full_invest, aes(x = factor(Incident.Year), y = perc, fill = (Is.Full.Investigation)))
Full + geom_bar(stat = "identity", width = 0.5) +
  scale_y_continuous(breaks = brks, labels = scales::percent(brks)) +
  labs( x = "Incident Year", y = "",
          title = "% of fully investigated cases each year", 
          caption = "Source: CCRB data transparency initiative" )

Fully Investigated cases by Area

There’s no obvious difference between fully investigated / not across area, while Brooklyn is with the most cases in both conditions.

# The count of fully investigated cases by area
mm <- ggplot(ccrb, aes(Borough.of.Occurrence,..count..))
mm + geom_point(stat = "count", size = 3) +
  coord_flip() +
  facet_grid(Is.Full.Investigation ~ .)+
  labs( title = "count of fully investigated cases by area", 
        caption = "Source: CCRB data transparency initiative" )

Video Evidence or not by Area

Although most cases don’t have video evidence, the distribution of the the cases across area are almost the same. Which may indicates that the video equipment is evenly distributed across the area.

# The Video Evidence by area
cc <- ggplot(ccrb, aes(Borough.of.Occurrence,..count..))
cc + geom_histogram(color = "white", stat = "count") +
  coord_flip() +
  facet_grid(Complaint.Has.Video.Evidence ~ .)+
  labs( 
        title = "count of cases by area by video evidence", 
        caption = "Source: CCRB data transparency initiative" )

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Count of Incident location

Most cases happend in Street or Highway, while the silver medal goes to Apartment / House.

# Count of Incident Location
kk <- ggplot(ccrb, aes(x = Incident.Location , y = ..count..))
kk + geom_point(col = "tomato2", size = 2, stat = "count") +
  coord_flip() +
  labs( title = "count of Incident.Location", 
                      caption = "Source: CCRB data transparency initiative" )

Incident location count by Encount.Outcome

Seems there’s no correlation between the results or the case and the place where the case happened - as the cases distribution are almost the same.

# Incident.Location count by Encount.Outcome
ee <- ggplot(ccrb, aes(Incident.Location,..count..))
ee + geom_point(stat = "count", size = 1 ) +
  coord_flip() +
  facet_grid(Encounter.Outcome ~ .)+
  labs( title = "count of Incident.Location investigated cases by Encounter.Outcome", 
        caption = "Source: CCRB data transparency initiative" )

Mapping the Case

Just to visualize on a map of where did the case happend, while the size of the dot is the case count.

# Mapping the cases count

palace_ct <- ccrb %>% 
  group_by(Borough.of.Occurrence) %>% 
  summarise(count = n())

New_York <- geocode("New York")

New_York_osm_map <- qmap("New York", zoom = 10, source = "google", maptype="roadmap")   

NY_places <- c("Bronx",
                    "Brooklyn",
                    "Manhattan",
                    "Queens",
                    "Staten Island")

NY_places_loc <- geocode(NY_places)
NY_places_loc$Borough.of.Occurrence <- c("Bronx", "Brooklyn", "Manhattan", "Queens",   "Staten Island")

Place <- merge(filter(palace_ct, Borough.of.Occurrence != "Outside NYC"), NY_places_loc, by = "Borough.of.Occurrence" )


New_York_osm_map + geom_point(aes(x = lon, y = lat, size = count),
                              data = Place,
                              color = "tomato") +  
  geom_encircle(aes(x=lon, y=lat),
                data = Place, size = 2, color = "blue")+
                          labs( 
                                title = "count of cases by area in a map", 
                                caption = "Source: CCRB data transparency initiative" )

Year difference between case receive year and close year

As we are interested in how efficient the case has been dealt with, we want to see the difference between the case receive year and close year. Seems the efficiency is pretty good - about 94% of cases were closed in the same year or just the year after.

year_diff <- ccrb %>% 
  mutate(yr_diff = Close.Year - Received.Year) %>% 
  group_by(yr_diff) %>% 
  count(yr_diff) %>% 
  mutate(sum_cnt = sum(n),
         prct = paste(round(n/sum_cnt*100,2),"%"))

year_diff

## # A tibble: 8 x 4
##   yr_diff     n sum_cnt    prct
##     <int> <int>   <int>   <chr>
## 1       0 93575  200109 46.76 %
## 2       1 95708  200109 47.83 %
## 3       2 10575  200109  5.28 %
## 4       3   160  200109  0.08 %
## 5       4    51  200109  0.03 %
## 6       5     8  200109     0 %
## 7       6    19  200109  0.01 %
## 8       9    13  200109  0.01 %

year_diff_grp <- ccrb %>% 
  mutate(yr_diff = Close.Year - Received.Year) %>% 
  group_by(yr_diff) %>% 
  count(yr_diff) %>% 
  mutate(sum_cnt = sum(n),
         prct =n/sum_cnt) %>% 
  arrange(desc(prct))

ww <- ggplot(year_diff_grp, aes(x=as.character(yr_diff), y = prct, color = as.character(yr_diff)))
ww + geom_bar(stat = "identity", fill = "white") +
  geom_text(aes(label= paste(round(prct*100,2),"%")), vjust=-0.3, size=2) +
  labs( xlab = "Year difference", ylab = "Percentage",
    title = "Year difference count between receive year and close year", 
    caption = "Source: CCRB data transparency initiative" )

ANLY 512 - Problem Set 4 - Wen He - 197626

Exploratory Data Analysis

Wen He

2017-07-07