The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
library(readxl)
## Warning: package 'readxl' was built under R version 3.6.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages -------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 3.0.1 v dplyr 0.8.4
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
temp = tempfile(fileext = ".xlsx")
dataURL <- "http://www1.nyc.gov/assets/ccrb/downloads/excel/ccrb_datatransparencyinitiative.xlsx"
download.file(dataURL, destfile=temp, mode='wb')
dict <- read_excel(temp, sheet = 1)
data <- read_excel(temp, sheet = 2)
data %>%
ggplot(aes(x = `Received Year`, fill = `Allegation FADO Type`)) +
geom_bar(position = "stack") +
labs(title = "Complaints received by year")
data %>%
ggplot(aes(x = `Received Year`, fill = `Is Full Investigation`)) +
geom_bar(stat = "count") +
labs(title = "Complaints received by year")
data %>%
ggplot(aes(x = `Received Year`, fill = `Complaint Has Video Evidence`)) +
geom_bar(stat = "count") +
labs (title = "Number of complaints that video evidence")
data %>%
ggplot(aes(x= `Received Year`, fill = `Complaint Filed Mode`)) +
geom_bar(stat="count") +
labs(title = "Top Mode of Complaints by Year from 2004 to 2016")
data %>%
filter(`Received Year` > 2004) %>%
ggplot(aes(x = `Received Year`, fill = `Borough of Occurrence`)) +
geom_bar(position = "dodge") +
labs(title = "Number of compliants by borough")
data %>%
ggplot(aes(x = `Borough of Occurrence`, fill= `Allegation FADO Type`)) +
geom_bar(stat = "count") +
labs (title = "Frequency of Incident Occurence by Borough and Type",
x = "Borough of Occurence",
y = "Frequency of Occurrence")
data %>%
ggplot(aes(x = `Close Year`, fill = `Is Full Investigation`)) +
geom_bar(stat = "count") +
labs (title = "Number of Case Closed Each Year by Investigation")
data %>%
ggplot(aes(x = `Encounter Outcome`, fill = `Complaint Has Video Evidence`)) +
geom_bar(stat = "count") +
labs (title = "Number of Incident Occurred Each Year by Evidence") +
theme (legend.position = "bottom")
data %>%
ggplot(aes(x = `Incident Year`, y = `Close Year`)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
labs(title = "Relation between incident year and closed year")
From this lecture, I learned how important EDA is on any data analysis. EAD helps a lot to understand and summerize the dataset without making any assumptions. Also, we can clearly find the conclusion based on the plots.