The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
library(readxl)
ccpd <- read_excel("C:/Users/Vinay/ccrb_datatransparencyinitiative.xlsx", 2)
save(ccpd, file ="ccpd.RData")
str(ccpd)
## Classes 'tbl_df', 'tbl' and 'data.frame': 204397 obs. of 16 variables:
## $ DateStamp : POSIXct, format: "2016-11-29" "2016-11-29" ...
## $ UniqueComplaintId : num 11 18 18 18 18 18 18 18 18 18 ...
## $ CloseYear : num 2006 2006 2006 2006 2006 ...
## $ ReceivedYear : num 2005 2004 2004 2004 2004 ...
## $ BoroughofOccurrence : chr "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
## $ IsFullInvestigation : num 0 1 1 1 1 1 1 1 1 1 ...
## $ ComplaintHasVideoEvidence : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ComplaintFiledMode : chr "On-line website" "Phone" "Phone" "Phone" ...
## $ ComplaintFiledPlace : chr "CCRB" "CCRB" "CCRB" "CCRB" ...
## $ ComplaintContainsStop&FriskAllegations: num 0 0 0 0 0 0 0 0 0 0 ...
## $ IncidentLocation : chr "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
## $ IncidentYear : num 2005 2004 2004 2004 2004 ...
## $ EncounterOutcome : chr "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
## $ ReasonForInitialContact : chr "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
## $ AllegationFADOType : chr "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
## $ AllegationDescription : chr "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...
Between 2006 and 2015, the number of complaints received by the CCRB has steadily declined, from 7,663 in 2006 to 4,461 in 2015.
library(ggplot2)
ccpd %>%
filter(ReceivedYear > 2006)%>%
group_by(ReceivedYear) %>%
summarise(UniqueCaseY = n_distinct(UniqueComplaintId)) %>%
ggplot(aes(x=ReceivedYear, y=UniqueCaseY)) +
geom_line(group=1, color='red', size=2) +
geom_point(color='darkred') +
labs(title="Complain Received from 2006 to 2015", x="Year", y="Number of Complaints") + theme(legend.position="none",
panel.grid.major = element_blank(),
panel.background = element_rect(colour = "lightblue"),
axis.line = element_line(colour = "lightblue"))
Between 2006 and 2015, the number of complaints Closed by the CCRB has steadily declined.
library(ggplot2)
library(ggthemes)
ccpd %>%
group_by(CloseYear) %>%
summarise(UniqueCaseY = n_distinct(UniqueComplaintId)) %>%
ggplot(aes(x=CloseYear, y=UniqueCaseY)) +
geom_line(group=1,color='darkred', size=2) +
labs(title="Closed Complaints from 2006 to 2015", x="Year", y="No of Closed Complaints") +theme_solarized()
Between 2006 and 2016, the top mode of complaints received by the CCRB is through Phone.
library(ggplot2)
library(ggthemes)
ccpd %>%
filter(ReceivedYear > 2006)%>%
ggplot(aes(x= ReceivedYear, fill = ComplaintFiledMode)) +
geom_bar(stat="count") +
coord_flip() +
labs(title = "Top Mode of Complaints by Year from 2006 to 2016", x = "", y = "Complaint Count")
The following histogram shows that Brooklyn received most complaints, and follow by Manhattan,Bronx and Queens
library(ggthemes)
ccpd %>%
group_by(BoroughofOccurrence) %>%
summarise(Cases = n_distinct(UniqueComplaintId)) %>%
ggplot(aes(x=reorder(BoroughofOccurrence,-Cases), y=Cases)) +
geom_bar(fill='Darkred', stat='identity') +
labs(title="Total Number of complaints by Place of Occurence", x="Place of Occurence", y="Number of Complaints") +
theme(legend.position="none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
The following graphs shows the overall trends of complaints by each place.Brooklyn has highest number of complaints in 2006 but has gradully reduced over years.
library(ggthemes)
ccpd %>%
group_by(ReceivedYear, BoroughofOccurrence) %>%
summarise(UniqueCaseY = n_distinct(UniqueComplaintId)) %>%
ggplot(aes(x=ReceivedYear, y=UniqueCaseY, group=ReceivedYear, color=BoroughofOccurrence)) +
facet_grid(BoroughofOccurrence ~., scales='free') +
geom_point(color='lightblue') +
geom_line(group=1) +
labs(title="Trends over years by Borough of Occurrence", x="Received Years", y="Number of Complaints") +
theme(panel.grid.major = element_blank(),
panel.background = element_rect(color='lightblue'),
axis.line = element_line(colour = "black"))
The following line graphs shows the trend of Number of Allegation Types over the years. From 2006 through 2015, the CCRB received over 40,000 complaints with abuse of authority allegation. But as we can see from the graph the trend is declining over the years.
library(ggthemes)
ccpd %>%
group_by(ReceivedYear, AllegationFADOType) %>%
filter(ReceivedYear > 2006)%>%
summarise(UniqueCaseY = n_distinct(UniqueComplaintId)) %>%
ggplot(aes(x=ReceivedYear, y=UniqueCaseY, color=AllegationFADOType)) + geom_line(size = 1.5) +geom_point(size = 1.1) + labs(title="Trends over years by FADO Type", x="Years", y="Number of Complaints")+theme_hc(bgcolor = "darkunica") +
scale_fill_hc("darkunica")
The following graphs shows the overall trends of complaints by each place where the complaint has been filed. CCRB has recieved highest number of complaints over the years.
library(ggplot2)
library(ggthemes)
ccpd %>%
filter(ReceivedYear > 2006)%>%
ggplot(aes(ReceivedYear,fill=ComplaintFiledPlace)) + geom_bar()+guides(fill=guide_legend(title="Complaint Place "))+
labs(title="Complaint Place by Year Received from 2006 to 2015", x="Years", y="Number of Complaints") +scale_x_discrete(labels=c('2007.5'='2007', "2010.0"="2010", "2012.5"="2012", "2015.0"="2015"))+ theme_pander() +
scale_fill_pander()
In reported incidents of police misconduct, police interaction concludes with three main types of action: (1) an arrest is made, (2) a summons is issued or (3) neither an arrest nor summons occurs. No arrest or summons has been the most common outcome over time.
library(ggthemes)
ccpd %>%
group_by(ReceivedYear,EncounterOutcome) %>%
filter(ReceivedYear > 2006)%>%
summarise(Cases = n_distinct(UniqueComplaintId)) %>%
ggplot(aes(x=ReceivedYear,y=Cases,fill=factor(EncounterOutcome))) +
geom_bar(position="dodge",stat="identity")+ labs(title="Encounter Outcome By Year", x="Received Year", y="Number of Complaints")+guides(fill=guide_legend(title="Encounter Outcome"))+theme_minimal()
The number of complaints that contain video has increased since 2012. In 2015, just over 10% of cases closed contained video.
library(ggthemes)
ccpd %>%
group_by(CloseYear,ComplaintHasVideoEvidence) %>%
filter(CloseYear > 2006)%>%
summarise(Cases = n_distinct(UniqueComplaintId)) %>%
ggplot(aes(x=CloseYear,y=Cases,fill=factor(ComplaintHasVideoEvidence))) +
geom_bar(position="dodge",stat="identity")+ labs(title="Complaints containing Video Evidence", x="Close Year", y="Number of Complaints")+guides(fill=guide_legend(title="Video Evidence"))+scale_color_manual(labels = c("0", "1"), values = c("No", "Yes"))+ theme_hc() +
scale_colour_hc()
The percentage of CCRB complaints that are fully investigated has gradually increased from 19% of complaints in 2006 to over 30% in 2016.
library(ggthemes)
ccpd %>%
group_by(CloseYear,IsFullInvestigation) %>%
filter(CloseYear > 2006)%>%
summarise(Cases = n_distinct(UniqueComplaintId)) %>%
ggplot(aes(x=CloseYear,y=Cases,fill=factor(IsFullInvestigation))) +
geom_bar(position="dodge",stat="identity")+ labs(title="CCRB complaints fully investigated over years", x="Close Year", y="Number of Complaints")+guides(fill=guide_legend(title="Fully Investigated"))+scale_color_manual(labels = c("0", "1"), values = c("No", "Yes"))+ theme_hc() +
scale_colour_hc()