ANLY 512 - Problem Set 4

Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Load the data

library(readxl)
library(tidyverse)

temp = tempfile(fileext = ".xlsx")
dataURL <- "http://www1.nyc.gov/assets/ccrb/downloads/excel/ccrb_datatransparencyinitiative.xlsx"
download.file(dataURL, destfile=temp, mode='wb')

dict <- read_excel(temp, sheet = 1)
data <- read_excel(temp, sheet = 2)

1. Number of complaints each year

First, we may want to check the number of complaints received each year, and the composition of each year’s complaints.

data %>% 
  ggplot(aes(x = `Received Year`, fill = `Allegation FADO Type`)) + 
  geom_bar(position = "stack") + 
  labs(title = "Complaints received by year")

We could see that the number of complaints received has a downside trend since 2005.

2. How many cases is under full investigation?

Also we want to know after the complaint was received, how many of them has been fully investigated (by year)

data %>% 
  ggplot(aes(x = `Received Year`, fill = `Is Full Investigation`)) +
  geom_bar(stat = "count") +
  labs(title = "Complaints received by year")

We can see that even though the complaints received is decreasing, the percentage of fully investegated complaints seems not have a big change.

3. How many of them has video evidence

data %>%
  ggplot(aes(x = `Received Year`, fill = `Complaint Has Video Evidence`)) +
  geom_bar(stat = "count") +
  labs (title = "Number of complaints that video evidence")

We can see in later years there are more and more complaints has video evidence.

4. Complaint file type

data %>% 
  ggplot(aes(x= `Received Year`, fill = `Complaint Filed Mode`)) +
  geom_bar(stat="count") +
  labs(title = "Top Mode of Complaints by Year from 2004 to 2016")

Most of complaints are come from phone call.

5. How’s the number of compliants change in different year and different area?

With location information, we can know the number of complaints at each borough every year.

data %>% 
  filter(`Received Year` > 2004) %>% 
  ggplot(aes(x = `Received Year`, fill = `Borough of Occurrence`)) + 
  geom_bar(position = "dodge") + 
  labs(title = "Number of compliants by borough")

From the result we can see that the number of complaints from Brooklyn and Bronx are the most two, but they’re decreasing. The other area, however, do not have a significant trend.

6. In encounter outcome: percentages of each outcome

Since we have the number of complaints at every borough, we also want to look at the frequency of incident occurance type at each borough

data %>% 
  ggplot(aes(x = `Borough of Occurrence`, fill= `Allegation FADO Type`)) +
  geom_bar(stat = "count") + 
  labs (title = "Frequency of Incident Occurence by Borough and Type", 
        x = "Borough of Occurence", 
        y = "Frequency of Occurrence")

We can see that brooklyn has the highest incident rate. One may thinking it because brooklyn has the largest population among these boroughs, but Queens has the second largest population, yet the incident frequency is pretty low.

7. For closed cases, how many of them are fully investigated?

data %>% 
  ggplot(aes(x = `Close Year`, fill = `Is Full Investigation`)) +
  geom_bar(stat = "count") + 
  labs (title = "Number of Case Closed Each Year by Investigation")

Roughly half of closed cases are fully investigated.

8. To compare the encounter outcomes of incidents in different boroughj

data %>% 
  ggplot(aes(x = `Encounter Outcome`, 
             fill = `Borough of Occurrence`)) +
  geom_bar(stat = "count") +
  labs(title = "Encounter Outcomes by borough")

We can see that incendents that ends with arrest or summon are likely over 50%, and since Brooklyn has the most incidents, it has the most arrest and summons as well.

9. Encounter outcome’s relation with video evidence

data %>% 
  ggplot(aes(x = `Encounter Outcome`, fill = `Complaint Has Video Evidence`)) +
  geom_bar(stat = "count") + 
  labs (title = "Number of Incident Occurred Each Year by Evidence") + 
  theme (legend.position = "bottom")

Since there are there are not many incidents are come with video evidence, from the plot it’s how clear that with an video evidence will has a relation with the encounted outcome.

10. Relationship between incident year ~ colsed year.

data %>% 
  ggplot(aes(x = `Incident Year`, y = `Close Year`)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F) +
  labs(title = "Relation between incident year and closed year")

We can see that the the case will usually closed in 5 years, but since it’s on year level, so the dots are concentrated. But we can see that the fitted line is very skewed to the right side, which indicates that there are more colsed cases are actually closed at year 4-5 after it’s incident year.

Summary

From this chapter, I learned that Exploratory data analysis (EDA) is an important step in any data analysis. EDA helps understand and summarize a dataset without making any assumptions about its contents. And we can find interesting patterns from EDA, which is crucial for the following steps of data analysis, like modeling.