Objectives

The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).

For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Load the data

library(readxl)
## Warning: package 'readxl' was built under R version 3.6.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages -------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  3.0.1     v dplyr   0.8.4
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
temp = tempfile(fileext = ".xlsx")
dataURL <- "http://www1.nyc.gov/assets/ccrb/downloads/excel/ccrb_datatransparencyinitiative.xlsx"
download.file(dataURL, destfile=temp, mode='wb')

dict <- read_excel(temp, sheet = 1)
data <- read_excel(temp, sheet = 2)

number of complaints each year

data %>% 
  ggplot(aes(x = `Received Year`, fill = `Allegation FADO Type`)) + 
  geom_bar(position = "stack") + 
  labs(title = "Complaints received by year")

how many cases under full investigation

data %>% 
  ggplot(aes(x = `Received Year`, fill = `Is Full Investigation`)) +
  geom_bar(stat = "count") +
  labs(title = "Complaints received by year")

how many cases have video evidence

data %>%
  ggplot(aes(x = `Received Year`, fill = `Complaint Has Video Evidence`)) +
  geom_bar(stat = "count") +
  labs (title = "Number of complaints that video evidence")

complaint filed mode

data %>% 
  ggplot(aes(x= `Received Year`, fill = `Complaint Filed Mode`)) +
  geom_bar(stat="count") +
  labs(title = "Top Mode of Complaints by Year from 2004 to 2016")

the number of compliants changed by year and area

data %>% 
  filter(`Received Year` > 2004) %>% 
  ggplot(aes(x = `Received Year`, fill = `Borough of Occurrence`)) + 
  geom_bar(position = "dodge") + 
  labs(title = "Number of compliants by borough")

frequency by borough and type

data %>% 
  ggplot(aes(x = `Borough of Occurrence`, fill= `Allegation FADO Type`)) +
  geom_bar(stat = "count") + 
  labs (title = "Frequency of Incident Occurence by Borough and Type", 
        x = "Borough of Occurence", 
        y = "Frequency of Occurrence")

closed case in full investigation

data %>% 
  ggplot(aes(x = `Close Year`, fill = `Is Full Investigation`)) +
  geom_bar(stat = "count") + 
  labs (title = "Number of Case Closed Each Year by Investigation") 

encounter outcomes by borough

data %>% 
  ggplot(aes(x = `Encounter Outcome`, fill = `Complaint Has Video Evidence`)) +
  geom_bar(stat = "count") + 
  labs (title = "Number of Incident Occurred Each Year by Evidence") + 
  theme (legend.position = "bottom") 

relation between incident year and closed year

data %>% 
  ggplot(aes(x = `Incident Year`, y = `Close Year`)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F) +
  labs(title = "Relation between incident year and closed year")

SUMMARY

From this lecture, I learned how important EDA is on any data analysis. EAD helps a lot to understand and summerize the dataset without making any assumptions. Also, we can clearly find the conclusion based on the plots.