The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:
library(readxl)
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(ggthemes)
library(forcats)
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
ccrb <- read_excel("C:/Users/khush/Downloads/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
ccrb
## # A tibble: 204,397 x 16
## DateStamp UniqueComplaintId Close.Year Received.Year
## <dttm> <dbl> <dbl> <dbl>
## 1 2016-11-29 11 2006 2005
## 2 2016-11-29 18 2006 2004
## 3 2016-11-29 18 2006 2004
## 4 2016-11-29 18 2006 2004
## 5 2016-11-29 18 2006 2004
## 6 2016-11-29 18 2006 2004
## 7 2016-11-29 18 2006 2004
## 8 2016-11-29 18 2006 2004
## 9 2016-11-29 18 2006 2004
## 10 2016-11-29 18 2006 2004
## # ... with 204,387 more rows, and 12 more variables:
## # Borough.of.Occurrence <chr>, Is.Full.Investigation <lgl>,
## # Complaint.Has.Video.Evidence <lgl>, Complaint.Filed.Mode <chr>,
## # Complaint.Filed.Place <chr>,
## # `Complaint.ContainsStop&FriskAllegations` <lgl>,
## # Incident.Location <chr>, Incident.Year <dbl>, Encounter.Outcome <chr>,
## # Reason.For.Initial.Contact <chr>, Allegation.FADO.Type <chr>,
## # Allegation.Description <chr>
summary(ccrb)
## DateStamp UniqueComplaintId Close.Year Received.Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint.Filed.Mode Complaint.Filed.Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint.ContainsStop&FriskAllegations Incident.Location Incident.Year
## Mode :logical Length:204397 Min. :1999
## FALSE:119856 Class :character 1st Qu.:2007
## TRUE :84541 Mode :character Median :2009
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Encounter.Outcome Reason.For.Initial.Contact Allegation.FADO.Type
## Length:204397 Length:204397 Length:204397
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Allegation.Description
## Length:204397
## Class :character
## Mode :character
##
##
##
ccrb %>% count(Received.Year)
## # A tibble: 17 x 2
## Received.Year n
## <dbl> <int>
## 1 1999 13
## 2 2000 6
## 3 2002 15
## 4 2003 25
## 5 2004 864
## 6 2005 11935
## 7 2006 23864
## 8 2007 24384
## 9 2008 22191
## 10 2009 21365
## 11 2010 17817
## 12 2011 16454
## 13 2012 15683
## 14 2013 14799
## 15 2014 13658
## 16 2015 12680
## 17 2016 8644
ccrb %>% count(Borough.of.Occurrence)
## # A tibble: 7 x 2
## Borough.of.Occurrence n
## <chr> <int>
## 1 Bronx 49442
## 2 Brooklyn 72215
## 3 Manhattan 42104
## 4 NA 483
## 5 Outside NYC 170
## 6 Queens 30883
## 7 Staten Island 9100
Visualising number of complaints received each year.
complaint1<-sqldf('select distinct UniqueComplaintId,"Received.Year" as Year from ccrb')
ggplot(complaint1) +
geom_line(aes(Year, group = 1),
stat = 'count',
color = 'darkblue',
size = 1.5) +
theme_economist() +
labs(title = 'Fig.1: Complaints Received in Each Year',
x = 'Year',
y = 'Complaints Received')
As we can observe from the above graph that number of complaints gradually reduced.
we need to observe the number of incident received per year on the basis of borough.
ggplot(ccrb, aes(x=ccrb$Received.Year, fill= ccrb$Borough.of.Occurrence)) + geom_bar(stat = "count") + labs (title = "Borough of Occurrence by Received Year", x="Year", y="Number") + scale_fill_discrete(name = "Borough of Occurrence")
From the above graph we can say that most of the incident was received against brooklyn borough.
Now we need to analyse that where do people mostly complaint regarding incident.
ggplot(ccrb, aes(x=ccrb$Received.Year, fill= ccrb$Complaint.Filed.Place)) + geom_bar(stat = "count") + labs (title = "Received Year by Complaint Filed Place", x="Received Year", y="Number") + scale_fill_discrete(name = "Complaint Filed Place")
## Warning: position_stack requires non-overlapping x intervals
We can observe that most of the complaints were filled in CCRB.
ggplot(ccrb, aes(x=ccrb$Borough.of.Occurrence, fill= ccrb$Complaint.Filed.Place)) + geom_bar(stat = "count") + labs (title = "Borough of Occurrence by Complaint Filed Place", x="Borough of Occurrence", y="Number") + scale_fill_discrete(name = "Complaint Filed Place")
ggplot(ccrb,aes(x=ccrb$Received.Year, fill=ccrb$Allegation.FADO.Type)) + geom_bar(stat = "count") + labs (title = "Allegation FADO Type Each Year", x="Year", y="Complaints")+ scale_fill_discrete(name = "Allegation Fado Type")+scale_x_continuous(breaks = seq(1999,2016,2))
We can see that most of the allegation fado type was “Abuse of authority” but with time the ratio came down
Nextly I would like to see the location where the incident occured and the fado type
ggplot(ccrb, aes(x=ccrb$Incident.Location, fill = ccrb$Allegation.FADO.Type ))+geom_bar(stat="count")+ labs (title = "Number of Incident Occurred by FADO Type", x="Incident Location", y="Number Of Incidents") + theme (legend.position = "bottom") + scale_fill_discrete(name = "FADO Type")+scale_y_continuous(breaks = seq(0,125000,4000))+theme(axis.text.x = element_text(angle = 90, hjust = 1))+coord_flip()
Above graph shows that most of the incident occurred in street/highway and apartment where the FADO type is “Abuse of authority”
Now,let’s analyse that whether the complaint was filled at the same time when the incident occurred.
fit1 <- lm(ccrb$Incident.Year~ccrb$Received.Year)
plot(ccrb$Incident.Year,ccrb$Received.Year, main = "Incident Year & Received Year", xlab = "Incident Year", ylab = "Received Year", pch = 20)
abline(fit1,col="blue")
We can see that there are two outlier which means that the complaint for this incident was filled after a long time.
This visualization will help us to know how long it took to close the complaint
ggplot(ccrb,aes(ccrb$Received.Year,ccrb$Close.Year))+geom_point() + geom_smooth(method = lm) + labs (title = "Received Year Vs Close Year", y="Close Year", x="Received Year")+scale_x_continuous(breaks = seq(1999,2016,2))+scale_y_continuous(breaks = seq(1999,2016,2))+theme(panel.grid.minor = element_line(colour="white", size=0.5))
It is also important to know the the mode of filling the complaint.
ggplot(ccrb) +
geom_bar(aes(Received.Year, fill = Complaint.Filed.Mode),
stat = 'count') +
scale_fill_discrete() +
labs(title = 'Fig.3: Complaints Received by Filed Mode in Each Year',
x = 'Year',
y = 'Complaints Received')
We can observe that most of the complaints were filled via phone.
Now would like to analyse that whether each complaint has video as an evidence.
ggplot(ccrb, aes(x=ccrb$UniqueComplaintId, fill= ccrb$Complaint.Has.Video.Evidence)) + geom_bar(stat = "count") + labs (title = "UniqueComplaintId by Complaint Has Video Evidence", x="UniqueComplaintId", y="Number") + scale_fill_discrete(name = "Has Video Evidence")
Above graph states that not most of the graph has video as an evidence.
Now we need to analyse that whther complaint was fully investigated and does it have any video evidence.
ggplot(ccrb, aes(x=ccrb$Is.Full.Investigation, fill= ccrb$Complaint.Has.Video.Evidence)) + geom_bar(stat = "count") + labs (title = "Investigation by Evidence", x="Is Full Investigation", y="Number") + scale_fill_discrete(name = "Has Video Evidence")
From the above graph we can observe that Most of the complaints which were fully investigated doesn’t have video evidence but we can also observe that there were some cases which had video evidence which might have helped to complete the investigation. The complaints which were not fully investigated had very few number of video evidence
Now I would like to analyse number of complaints were fully investigated each year.
ggplot(ccrb, aes(x=ccrb$Close.Year, fill= ccrb$Is.Full.Investigation)) + geom_bar(stat = "count") + labs (title = "Number of Case Closed Each Year by Investigation", x="Close Year", y="Number") + scale_fill_discrete(name = "Is Full Investigation")
At last I would like to observe what was the outcome after the case was closed.
ggplot(ccrb, aes(x=ccrb$Close.Year, fill= ccrb$Encounter.Outcome)) + geom_bar(stat = "count") + labs (title = "Close Year by Outcome", x="Close Year", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Encountered Outcome")
-Number of complaints gradually reduced after 2009. -People prefer to file complaint in CCRB but some incidents were reported after few months/years from the date of incident -Most of the incident was received against brooklyn borough. -Most of the allegation fado type was “Abuse of authority”. -Most of the incident occurred in street/highway and apartment where the FADO type is “Abuse of authority”.