Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Package

library(readxl)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.3.3

Read Excel File

data <- read_excel("ccrb_datatransparencyinitiative.xlsx", sheet = 2)
data <- data.frame(data)
colnames(data)
##  [1] "DateStamp"                                  
##  [2] "UniqueComplaintId"                          
##  [3] "Close.Year"                                 
##  [4] "Received.Year"                              
##  [5] "Borough.of.Occurrence"                      
##  [6] "Is.Full.Investigation"                      
##  [7] "Complaint.Has.Video.Evidence"               
##  [8] "Complaint.Filed.Mode"                       
##  [9] "Complaint.Filed.Place"                      
## [10] "Complaint.Contains.Stop...Frisk.Allegations"
## [11] "Incident.Location"                          
## [12] "Incident.Year"                              
## [13] "Encounter.Outcome"                          
## [14] "Reason.For.Initial.Contact"                 
## [15] "Allegation.FADO.Type"                       
## [16] "Allegation.Description"

Graphics

Visualization 1

To look at the distribution of when the complaints were received. This helps to understand the trend of number of complaints received per year. I can see the number reached the top in 2007 and then showed a decreasing trend through 2016.

ggplot(data, aes(Received.Year)) +
  geom_histogram(breaks=seq(1999, 2016, by = 1),  
                 fill="blue", col="black",
                 alpha = .6) + 
  labs(title="Histogram for Complaint Received Year") +
  labs(x="Received Year", y="Count")

Visualization 2

To look at the distribution of when the complaints were closed. By looking at this, I could also get a sense of how long it takes for the allegations to close. I can tell that they closed the most amount of cases in 2007 and then there is a decreasing trend for allegations closed. This could be because there were less and less claims reported after 2007.

ggplot(data, aes(Close.Year)) +
  geom_histogram(breaks=seq(2006, 2016, by = 1),  
                 fill="green", col="black",
                 alpha = .6) + 
  labs(title="Histogram for Complaint Close Year") +
  labs(x="Close Year", y="Count")

Visualization 3

After I get a basic understanding of when the allegations happened and closed, I can look at what I was thinking from last step, which is to look at how long normally an allegation will take from receiving to close. It varies case by case but most of them can be closed within 5 years.

ggplot(data, aes(x = Received.Year, y = Close.Year)) + 
  geom_point() + 
  labs(x="Received Year",y="Close Year",title="Relationship between Complaint Received Year and Close Year")

Visualization 4

To look at the distribution of where the incidents happened. This helps to understand the number of incident happened in each borough.It gives a rough idea of the living environment of each borough.Clearly, Brooklyn has the most claims and Staten Island has the least besides claims happened outside NYC.

ggplot(data, aes(Borough.of.Occurrence)) +
  geom_histogram(stat="count",fill="grey", col="black",
                 alpha = .8) + 
  labs(title="Histogram for Incident Happened in each Borough") +
  labs(x="Borough", y="Incident Count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Visualization 5

I want to understand the reason for initial contact. This helps me to learn why the allegations happened.The most common reason is PD suspected C/V on street.

ggplot(data, aes(Reason.For.Initial.Contact)) +
  geom_histogram(stat="count",fill="red", col="black",alpha = .6) + 
  coord_flip() +
  labs(title="Reasons For Initial Contact") +
  labs(x="Reason", y="Incident Count")+
  theme_classic()
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Visualization 6

After I have a basic understanding of when, where and why the incidents happened, I can now move on to more details. First, I want to look at the relationship between the video evidence and encounter outcome. This helps me to know whether video evidence is critical in the allegations.

ggplot(data, aes(x = Encounter.Outcome, fill = factor(Complaint.Has.Video.Evidence)))+ 
  geom_bar(stat = 'count', alpha = 0.7) + 
  labs(title='Relationship between Encounter Outcome and Video Evidence',x="Encounter Outcome", y="Number of Complaints") + 
  scale_fill_discrete(name="Video Evidence",labels=c("False","True"))

Visualization 7

After looking at the chart above, I found there is not a lot of allegations that had video evidence, so it’s hard to tell the relationship between it and the encounter outcome. I decided to look at variables that may have a relationship with video evidence. I will look at incident location first. I want to understand whether certain location has higher chance to have video evidence.

ggplot(data, aes(x = Incident.Location, fill = factor(Complaint.Has.Video.Evidence)))+ 
  geom_bar(stat = 'count', alpha = 0.7) + 
  coord_flip() +
  labs(title='Relationship between Incident Location and Video Evidence',x="Incident Location", y="Number of Complaints") + 
  scale_fill_discrete(name="Video Evidence",labels=c("False","True"))

Visualization 8

By looking at the previous graph, I can tell that there is not much I can do with video evidence. But I did find out that the most common place to have an incident is street/highway. So I want to look at whether encounter outcome has anythign to do with incident locations.

ggplot(data, aes(x = Incident.Location, fill = Encounter.Outcome))+ 
  geom_bar(stat = 'count', alpha = 0.7) + 
  coord_flip() +
  labs(title='Relationship between Incident Location and Encounter Outcome',x="Incident Location", y="Number of Complaints") + 
  scale_fill_discrete(name="Encounter Outcome")

Visualization 9

It still doesn’t seem clear whether the incident loaction is the factor that impacts the encounter outcome a lot. So I want to look at the relationship between how the complaint is filed and the encounter outcome in order to decide whether the complaint filed mode is a factor that impacts the encounter outcome. Turns out it is also hard to draw a conclusion by looking at this chart, but what I do know is that most of the complaints were filed by phone.

ggplot(data, aes(x = Complaint.Filed.Mode, fill = Encounter.Outcome))+ 
  geom_bar(stat = 'count') +
  labs(title='Relationship between How the Complaint was Filed and the Encounter Outcome',x="Complaint Filed Mode", y="Number of Complaints") + 
  scale_fill_discrete(name="Encounter Outcome") +
  theme_classic()+
  ylim(0, 150000)

Visualization 10

I want to look at the distribution of the complaint FADO type. The result shows that about half of the complaints are about abuse of authority.

pie(table(data$Allegation.FADO.Type))

Summary

Exploratory Data Analysis (EDA) gives me opportunities to see the data in a way that I wouldn’t if I don’t draw the charts. I gain both quantitative and qualitative understanding of the dataset I work on. I come up with questions and see what I want to discover then I answer my own questions with the graphs, which most of the time shows the result more intuitive. By looking at the graphs I drew in this analysis, I can tell that complaints received in 2007 and closed in 2007 are more than all the other years we have in the data. I can also tell that the closing time for the complaints is usually within 5 years. The data also shows that Brooklyn has the most claims and Staten Island has the least complaints besides claims happened outside NYC. And I found out that about half of the complaints are about abuse of authority. My guess would be abuse of authority was a serious problem in Brooklyn, but I will need to do more data manipulation and testing to verify my guess.