The objective of this kernel is to conduct an exploratory data analysis of “Complaints_Allegations” data set.
The dats set is pulled from the following : link. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data. We will be exploring the data from NYC Data Transparency Initiative wherein they maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Our objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
Loading the dataset and checking for descriptive statistics:
library(ggplot2)
library(ggthemes)
library(readxl)
ccrb<- read_excel("C:/Users/Kowshik Kumar B/Documents/HU/512/Problem Set 4/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
head(ccrb)
## # A tibble: 6 x 16
## DateStamp UniqueComplaint~ `Close Year` `Received Year`
## <dttm> <dbl> <dbl> <dbl>
## 1 2016-11-29 00:00:00 11 2006 2005
## 2 2016-11-29 00:00:00 18 2006 2004
## 3 2016-11-29 00:00:00 18 2006 2004
## 4 2016-11-29 00:00:00 18 2006 2004
## 5 2016-11-29 00:00:00 18 2006 2004
## 6 2016-11-29 00:00:00 18 2006 2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## # Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## # Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains Stop
## # & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## # Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial Contact` <chr>,
## # `Allegation FADO Type` <chr>, `Allegation Description` <chr>
tail(ccrb)
## # A tibble: 6 x 16
## DateStamp UniqueComplaint~ `Close Year` `Received Year`
## <dttm> <dbl> <dbl> <dbl>
## 1 2016-11-29 00:00:00 69476 2016 2016
## 2 2016-11-29 00:00:00 69476 2016 2016
## 3 2016-11-29 00:00:00 69476 2016 2016
## 4 2016-11-29 00:00:00 69476 2016 2016
## 5 2016-11-29 00:00:00 69476 2016 2016
## 6 2016-11-29 00:00:00 69476 2016 2016
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## # Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## # Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains Stop
## # & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## # Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial Contact` <chr>,
## # `Allegation FADO Type` <chr>, `Allegation Description` <chr>
str(ccrb)
## tibble [204,397 x 16] (S3: tbl_df/tbl/data.frame)
## $ DateStamp : POSIXct[1:204397], format: "2016-11-29" "2016-11-29" ...
## $ UniqueComplaintId : num [1:204397] 11 18 18 18 18 18 18 18 18 18 ...
## $ Close Year : num [1:204397] 2006 2006 2006 2006 2006 ...
## $ Received Year : num [1:204397] 2005 2004 2004 2004 2004 ...
## $ Borough of Occurrence : chr [1:204397] "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
## $ Is Full Investigation : logi [1:204397] FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ Complaint Has Video Evidence : logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Complaint Filed Mode : chr [1:204397] "On-line website" "Phone" "Phone" "Phone" ...
## $ Complaint Filed Place : chr [1:204397] "CCRB" "CCRB" "CCRB" "CCRB" ...
## $ Complaint Contains Stop & Frisk Allegations: logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Incident Location : chr [1:204397] "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
## $ Incident Year : num [1:204397] 2005 2004 2004 2004 2004 ...
## $ Encounter Outcome : chr [1:204397] "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
## $ Reason For Initial Contact : chr [1:204397] "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
## $ Allegation FADO Type : chr [1:204397] "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
## $ Allegation Description : chr [1:204397] "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...
summary(ccrb)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location Incident Year
## Mode :logical Length:204397 Min. :1999
## FALSE:119856 Class :character 1st Qu.:2007
## TRUE :84541 Mode :character Median :2009
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Encounter Outcome Reason For Initial Contact Allegation FADO Type
## Length:204397 Length:204397 Length:204397
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Allegation Description
## Length:204397
## Class :character
## Mode :character
##
##
##
Firstly, let us look at the data by simple frequncy plots to understand the trend of how the complaints are registered by different factors.
h1=ggplot(data = ccrb, mapping = aes(x = ccrb$`Complaint Filed Mode`)) +
geom_bar(stat = "count") +
labs(x = "Complaints Filed Mode") +
ggtitle("Frequency of Complaints Filed") +
theme(plot.title = element_text(hjust = 0.5))
h2=ggplot(data = ccrb, mapping = aes(x = ccrb$`Allegation FADO Type`)) +
geom_bar(stat = "count") +
labs(x = "Allegation FADO type") +
ggtitle("Frequency of Allegations FADO type") +
theme(plot.title = element_text(hjust = 0.5))
h3=ggplot(data = ccrb, mapping = aes(x = ccrb$`Borough of Occurrence`)) +
geom_bar(stat = "count") +
labs(x = "Complaints Filed Mode") +
ggtitle("Frequency of Complaints occurred at particular Borough") +
theme(plot.title = element_text(hjust = 0.5))
h4=ggplot(data = ccrb, mapping = aes(x = ccrb$`Complaint Has Video Evidence`)) +
geom_bar(stat = "count") +
labs(x = "Video Evidence") +
ggtitle("Frequency of Complaints has video evidence") +
theme(plot.title = element_text(hjust = 0.5))
h5=ggplot(data = ccrb, mapping = aes(x = ccrb$`Incident Location`)) +
geom_bar(stat = "count") +
labs(x = "Incident Location") +
ggtitle("Frequency of Incident Location") +
theme(plot.title = element_text(hjust = 0.5))
h6=ggplot(data = ccrb, mapping = aes(x = ccrb$`Encounter Outcome`)) +
geom_bar(stat = "count") +
labs(x = "Encounter Outcome") +
ggtitle("Frequency of Complaint's Encounter Outcome") +
theme(plot.title = element_text(hjust = 0.5))
h1
h2
h3
h4
h5
h6
From the above plots, we can clearly interpret that there are more of Complaints filed from Phones, Call Processing System, and On-line website In-person but less from Fax and Mail. There are more Complaints registered in the area of Brooklyn, Manhattan and Bronx in comparison to other Boroughs of NY. There are very few Complaints with Video evidences. There are more cases involving with Abuse of authority, Force and Discourtesy.There are more of Arrests, No arrests or Summons and Summons for the Encounter Outcome.
Next, let us dive into deep visualizations from sound understanding of the multi-variate activity by using ‘fill’.
ggplot(ccrb, aes(x=ccrb$`Borough of Occurrence`, fill=ccrb$`Allegation FADO Type`)) +
geom_histogram(stat="count") +
labs(title="Histogram of Incident Occurence by Borough and Type", x="Borough", y="Frequency of Incident Occurence") +
guides(fill=guide_legend(title = "Type of allegation"))
In the above plot, we can clearly interpret the ranking of Incidents occurred by Boroughs wherein Brooklyn ranks first with more of Abuse of authority type of allegation cases while others like Bronx and Manhattan follow next with the same pattern with respect to type of allegations.
ggplot(ccrb, aes(x=ccrb$`Borough of Occurrence`, fill=ccrb$`Complaint Has Video Evidence`)) +
geom_histogram(stat="count") +
labs(title="Histogram of Complaints with video evidence by Boroughs", x="Boroughs", y="Frequency of Complaints") +
guides(fill=guide_legend(title = "Video Evidence"))
In the above plot, we can clearly visualize that there are very few Complaints which the Boroughs have video evidences.
ggplot(ccrb, aes(x=ccrb$`Complaint Filed Mode`, fill=ccrb$`Is Full Investigation`)) +
geom_histogram(stat="count") +
labs(title="Histogram of Complaints filed mode with full Investigations", x="Complaints filed mode", y="Frequency of Investigations")+
guides(fill=guide_legend(title = "Full Investigation"))
In the above plot, we can clearly see that most of the Complaints filed are by phone, call processing system, on-line website and In-person. Wherein less than 50% cases are performed full investigations if the complaints are filed by phone, call-processing system and On-line websites. However, if the complaints are filed by In-person or by Mail then they are performed with more than 60% cases.
p1=ggplot(ccrb, aes(x=ccrb$`Incident Year`, fill=ccrb$`Complaint Has Video Evidence`)) +
geom_histogram(stat="count") +
labs(title="Histogram of Incident Year for Complaints with Video Evidence ", x="year", y="Frequency") +
guides(fill=guide_legend(title = "Video Evidence"))
p2=ggplot(ccrb, aes(x=ccrb$`Incident Location`, fill=ccrb$`Complaint Has Video Evidence`)) +
geom_histogram(stat="count") +
labs(title="Histogram of Incident Location for Complaints with Video Evidence ", x="Incident Location", y="Frequency") +
guides(fill=guide_legend(title = "Video Evidence"))
p1
p2
In the above plot, we can clearly visualize that there are very few Boroughs with video evidences and moreover they are monitored after 2010.
ggplot(ccrb, aes(x=ccrb$`Close Year`, fill=ccrb$`Is Full Investigation`)) +
geom_histogram(stat="count") +
labs(title="Histogram of Investigations closed by Year", x="Year Closed", y="Frequency") +
guides(fill=guide_legend(title = "Full Investigation"))
In the above plot, we can clearly see that the trend is similar for most of the full investigations for the year of cases closed.
ggplot(ccrb, aes(x=ccrb$`Encounter Outcome`, fill=ccrb$`Borough of Occurrence`)) +
geom_histogram(stat="count") +
labs(title="Histogram of Encounter Outcome with Borough Occurred ", x="Encounter Outcome", y="Frequency") +
guides(fill=guide_legend(title = "Boroughs"))
In the above plot, we can clearly visualize that the different Encounter outcomes with respect with the Boroughs which show important inferences how which Boroughs are contributing for most of the cast Encounter outcomes.
ggplot(ccrb, aes(x=ccrb$`Incident Year`, fill=ccrb$`Encounter Outcome`)) +
geom_histogram(stat="count") +
labs(title="Histogram of Incident year Occurred by encounter outcome", x="Incident year", y="Frequency") +
guides(fill=guide_legend(title = "Encounter Outcome"))
In the above plot, we can clearly visualize how the Encounter Outcomes have been in between 2005 and 2015.
ggplot(ccrb, mapping = aes(ccrb$`Incident Year`,ccrb$`Received Year`)) +
geom_point()+
ggtitle("Relationship between Incident year and Received year") +
xlab("Incident year")+
ylab("Received year")+
geom_smooth(method = 'lm', color = 'blue')+
theme_minimal()
ggplot(ccrb, mapping = aes(ccrb$`Close Year`,ccrb$`Received Year`)) +
geom_point()+
ggtitle("Relationship between Incident closed year and Received year") +
xlab("Closed year")+
ylab("Received year")+
geom_smooth(method = 'lm', color = 'blue')+
theme_minimal()
In the above two plots, we have performed scatterplots with linear method wherein we could see the linear trend for both relationships of Closed year and Received year alongside Incident year and Received year.
Summary:
To summarize, the type of data obtained from the databases of complaints that fall within the Civilian Complain Review Board (CCRB) from NYC Data Transparency is Categorical type which is of majority. If we directly look into the xlsx file of the dataset then we will have to ponder for any clues for what is happening with the data. However, only after performing visualizations to the dataset’s variable we could see the trends and could clearly interpret the data. Hence, we can clearly conclude that the key to Data analysis is Data Vis.