Importing data and starting visualization

library(ggplot2)
library(readxl)
data <- read_excel("C:/Users/Public/ccrb_datatransparencyinitiative.xlsx", 
    sheet = "Complaints_Allegations")

Viz 1 Starting with the trend of complains and allegations received from 2000 to 2016. By creating the histogram above, I found that the number of complains received peaked in 2007. After that, it started to decline gradually. That indicates NYPD has been putting effort on improving their manner of working.

ggplot(data, aes(x=data$`Allegation FADO Type`, fill=data$`Allegation FADO Type`)) + 
  geom_bar(stat = "count") +
  labs(title="No. of Complain by Allegation Type", x="Type", y="Number") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name="Type")

Viz 2 The ranking of different types of complains. Histogram above shows that most cases involved abuse of authority and use of force. Thus, if NYPD aims to keep decreasing No. of complains received, they should put most effort on avoiding unnecessary force and abuse of authority.

ggplot(data, aes(x=data$`Allegation FADO Type`, fill=data$`Allegation FADO Type`)) + 
  geom_bar(stat = "count") +
  labs(title="No. of Complain by Allegation Type", x="Type", y="Number") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name="Type")

Viz 3 Unsurprisingly, Brooklyn ranks No.1 as it has largest population among the five; Queens, with second largest population, however, ranks No.4 after Bronx and Manhattan. This indicates there is no positive relationship between population and frequency of incident occurrence.

ggplot(data, aes(x=`Borough of Occurrence`, fill='Allegation.FADO.Type')) +
  geom_histogram(stat="count") + 
  labs(title="Frequency of Incident Occurence by Borough and Type", x="Borough of Occurence", y="Frequence of Occurence") + 
  scale_fill_discrete(name="Allegation Type") +
  theme(legend.position = "right")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Viz 4 The bar chart below shows negative answers for both questions. NYPD should put more effort on investigating into people’s complains.

ggplot(data, aes(x=data$`Is Full Investigation`, fill=data$`Complaint Has Video Evidence`)) +
  geom_bar(stat = "count") +
  labs(title="Investigation by Evidence", x="Is Full Investigation", y="Number") + 
  scale_fill_discrete(name="Has Video Evidence")

Viz 5 Bar chart below shows there is an increasing trend of having video evidence since 2010. I believe the wide use of smartphones with camera is one of major drivers for such trend.

ggplot(data, aes(x=data$`Incident Year`, fill=data$`Complaint Has Video Evidence`)) + 
  geom_histogram(stat = "count") + 
  labs(title="No. of Incident Occurred Each Year by Evidence", x="Incident Year", y="Number") + 
  scale_fill_discrete(name="Has Video Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Viz 6 I created the bar chart for cases closed each year by whether whether it was fully investigated to see if was caused by high complaints volume in certain years. The bar chart below shows that how much effort they put on investigation do not have positive relationship with how many cases they closed in a certain year.

ggplot(data, aes(x=data$`Close Year`, fill=data$`Is Full Investigation`)) + 
  geom_histogram(stat = "count") +
  labs(title="No. of Cases Closed Each Year by Investigation", x="Close Year", y="Number") + 
  scale_fill_discrete(name="Is Full Investigation")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Viz 7 As shown in the bar chart below, generally, arrest has larger portion than summons.

ggplot(data, aes(x=data$`Incident Year`, fill=data$`Encounter Outcome`)) +  
  geom_histogram(stat = "count") + 
  labs(title="No. of Incidents Occurred Each Year by Outcome", x="Incident Year", y="Number") +  
  scale_fill_discrete(name="Encounter Outcome")  
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Viz 8 I also wanted to know whether the portion of outcome varies across borough in which incidents occurred. The bar chart below shows it is consistent with its general portion each year.

ggplot(data, aes(x=data$`Encounter Outcome`, fill=data$`Borough of Occurrence`)) +  
  geom_bar(stat = "count") + 
  labs(title="Encounter Outcome by Borough", x="Encounter Outcome", y="Number") +  
  scale_fill_discrete(name="Borough of Occurrence")

Viz 9 The scatter plot below shows most people took action within 2 years of the incident occurrence; there are, however, two extreme outliers indicating that people filed 5 and 10 years after the incident occurred.

ggplot(data, aes(x=data$`Incident Year`, y=data$`Received Year`)) + 
  geom_point() + 
  geom_smooth(method = lm) +
  labs(title="Relationship between Incident Year and Received Year", x="Incident Year", y="Received Year")

Viz 10 As shown below in the scatter plot, most cases closed within 2 years after being filed. Also, I can see NYPD’S work efficiency on investigating complaint has been improving since 2010.

ggplot(data, aes(x=data$`Received Year`, y=data$`Close Year`)) + 
  geom_point() + 
  geom_smooth(method = lm) +
  labs(title="Relationship between Received Year and Close Year", x="Received Year", y="Close Year")

Summary

Description of interesting patterns and alteration of data are the most important steps of any data analysis. Exploratory data analysis (EDA) helps us to find the relations/ structure of the data set collected. Before we start modelling the data set and test any hypotheses, by determining relationship between the different variables in the data set. For building this relationship we need to spend time compiling, plotting and reviewing actual data collected. Exploratory Data Analysis is often performed with a representative sample of the data. Here I tried to analyze the raw data collected from link to perform my analysis focused on the close date. Exploratory Data Analysis (EDA) is an approach for data analysis that engages a variety of techniques on the raw data set collected. The two basic types of EDA techniques that are generally used are the graphical techniques and quantitative techniques. Graphical techniques: Graphical techniques of data analysis show the properties of a data set in an acceptable graphical format. It makes easier for us to understand the properties and the relationships between the different variables of the data set by looking at different graphs rather than looking at the raw data collected for our analysis. The representation of the analysis becomes more prominent. The reason for relying on graphics is that the main purpose of EDA is to explore the data set, and graphics gives the analysts a visual power to do so. Appealing the data to reveal its structural secrets, and being always ready to gain some new, unattended, insight into the data.