Use the head and str commands for the basic EDA analysis. Remove duplicate data using unique command.
library(ggplot2)
library(ggthemes)
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(ggalt)
library(dygraphs)
library(xts)
library(plyr)
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## arrange(): dplyr, plyr
## compact(): purrr, plyr
## count(): dplyr, plyr
## failwith(): dplyr, plyr
## filter(): dplyr, stats
## first(): dplyr, xts
## id(): dplyr, plyr
## lag(): dplyr, stats
## last(): dplyr, xts
## mutate(): dplyr, plyr
## rename(): dplyr, plyr
## summarise(): dplyr, plyr
## summarize(): dplyr, plyr
ccrb <- read.csv("C:/Users/mks/Google Drive/HU/ANLY512-50/ccrb.csv")
head(ccrb)
## DateStamp UniqueComplaintId Close.Year Received.Year
## 1 11/29/2016 11 2006 2005
## 2 11/29/2016 18 2006 2004
## 3 11/29/2016 18 2006 2004
## 4 11/29/2016 18 2006 2004
## 5 11/29/2016 18 2006 2004
## 6 11/29/2016 18 2006 2004
## Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
## 1 Manhattan FALSE FALSE
## 2 Brooklyn TRUE FALSE
## 3 Brooklyn TRUE FALSE
## 4 Brooklyn TRUE FALSE
## 5 Brooklyn TRUE FALSE
## 6 Brooklyn TRUE FALSE
## Complaint.Filed.Mode Complaint.Filed.Place
## 1 On-line website CCRB
## 2 Phone CCRB
## 3 Phone CCRB
## 4 Phone CCRB
## 5 Phone CCRB
## 6 Phone CCRB
## Complaint.Contains.Stop...Frisk.Allegations Incident.Location
## 1 FALSE Street/highway
## 2 FALSE Street/highway
## 3 FALSE Street/highway
## 4 FALSE Street/highway
## 5 FALSE Street/highway
## 6 FALSE Street/highway
## Incident.Year Encounter.Outcome
## 1 2005 No Arrest or Summons
## 2 2004 Arrest
## 3 2004 Arrest
## 4 2004 Arrest
## 5 2004 Arrest
## 6 2004 Arrest
## Reason.For.Initial.Contact Allegation.FADO.Type
## 1 Other Abuse of Authority
## 2 PD suspected C/V of violation/crime - street Abuse of Authority
## 3 PD suspected C/V of violation/crime - street Discourtesy
## 4 PD suspected C/V of violation/crime - street Discourtesy
## 5 PD suspected C/V of violation/crime - street Discourtesy
## 6 PD suspected C/V of violation/crime - street Force
## Allegation.Description
## 1 Threat of arrest
## 2 Refusal to obtain medical treatment
## 3 Word
## 4 Word
## 5 Word
## 6 Physical force
str(ccrb)
## 'data.frame': 204397 obs. of 16 variables:
## $ DateStamp : Factor w/ 1 level "11/29/2016": 1 1 1 1 1 1 1 1 1 1 ...
## $ UniqueComplaintId : int 11 18 18 18 18 18 18 18 18 18 ...
## $ Close.Year : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
## $ Received.Year : int 2005 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
## $ Borough.of.Occurrence : Factor w/ 6 levels "Bronx","Brooklyn",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ Is.Full.Investigation : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ Complaint.Has.Video.Evidence : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Complaint.Filed.Mode : Factor w/ 7 levels "Call Processing System",..: 6 7 7 7 7 7 7 7 7 7 ...
## $ Complaint.Filed.Place : Factor w/ 14 levels "CCRB","Comm. to Combat Police Corruption",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Complaint.Contains.Stop...Frisk.Allegations: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Incident.Location : Factor w/ 15 levels "Apartment/house",..: 14 14 14 14 14 14 14 14 14 14 ...
## $ Incident.Year : int 2005 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
## $ Encounter.Outcome : Factor w/ 4 levels "Arrest","No Arrest or Summons",..: 2 1 1 1 1 1 1 1 1 1 ...
## $ Reason.For.Initial.Contact : Factor w/ 49 levels "Aided case","Arrest/Complainant",..: 23 32 32 32 32 32 32 32 32 32 ...
## $ Allegation.FADO.Type : Factor w/ 4 levels "Abuse of Authority",..: 1 1 2 2 2 3 3 3 3 3 ...
## $ Allegation.Description : Factor w/ 56 levels "Action","Animal",..: 48 35 56 56 56 27 27 27 27 27 ...
ccrb <- unique(ccrb)
ggplot(ccrb, aes(x = Borough.of.Occurrence, fill = Is.Full.Investigation)) + geom_bar(stat = 'count') + labs(title = "Full Investigation True or False", x = "Location", Y = "Count") + theme_economist()
To know the most frequent levels of categories, barplots are useful.The following bar plots give us an understanding of complaints in CCRB.
The Borough of Occurence variable is arranged in descending order of frequency counts which indicates that Brooklyn has the maximum number of incident occurences followed by Bronx,Manhattan and so forth. Outside NYC is negligible. This may be because people outside NYC may not file complaints at CCRB which is an NYC government organization.
library(tidyverse)
library(ggplot2)
library(forcats)
library(dplyr)
ggplot(ccrb, aes(x = fct_infreq(Borough.of.Occurrence))) +
geom_bar() + xlab("Borough.of.Occurrence")
ggplot(ccrb, aes(x = Borough.of.Occurrence, fill = Complaint.Has.Video.Evidence)) + geom_bar(stat = 'count') + labs(title = "Complaints have video evidence or not", x = "Location", Y = "Count") + theme_economist()
### Distribution of Incident Locations
p2data<-unique(data.frame(ccrb$UniqueComplaintId,ccrb$Incident.Location))
p2<-table(p2data$ccrb.Incident.Location)
p2<-p2/sum(p2)*100
p2<-p2[order(-p2)]
barplot(p2,ylab="Count (%)",ylim=c(0,60),las=2,main="Count of Incident Locations")
p3data<-unique(data.frame(ccrb$UniqueComplaintId,ccrb$Complaint.Filed.Mode,ccrb$Allegation.FADO.Type))
p3<-table(p3data$ccrb.Complaint.Filed.Mode)
p3<-p3/sum(p3)*100
p3<-p3[order(-p3)]
barplot(p3,xlab="Communication Method",ylab="Count (%)",ylim=c(0,100))
Complaint Filing Modes are arranged in descending order of frequency counts which indicate that the maximum number of complaints are received by Phone followed by Call Processing System and Online Website.The least popular mode of filing complaints is fax followed by Email. Interpreting this in the real world is that when we have an urgency we generally call the repective organization. If we cannot reach them, we leave a message. This explains the two popular modes.The least popular mode-fax is no longer used by many people and email is not the mode you would use for immediate reponse. However, mail is popular compared to email as people prefer to have a receipt for a written hard copy of a filed complaint as proof.
ccrb %>% mutate(Complaint.Filed.Mode = Complaint.Filed.Mode %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(Complaint.Filed.Mode)) +
geom_bar() + coord_flip() + xlab("Complaint.Filed.Mode")
The outcome of the encounter is arranged in descending order of frequency counts which indicate a surprising finding that No Arrest or Summons has a somewhat higher frequency than Arrest.
ggplot(ccrb, aes(x = fct_infreq(Encounter.Outcome))) +
geom_bar() + xlab("Encounter.Outcome")
The Allegation Types are arranged in descending order of frequency counts which indicate that Abuse of Authority is the most common Allegation type followed far behind by Force.
ggplot(ccrb, aes(x = fct_infreq(Allegation.FADO.Type))) +
geom_bar() + xlab("Allegation.FADO.Type")
When the incident location is arranged in descending order of frequency counts, we see that the most common location for incidents is the street/highway.However, the surprsing finding is that it is followed by Apartment/house. Though there is a huge difference between the counts for street/higway and Apartment/house, it is still scary to know that more incidents occur in Apartments/house compared to public buildings and trains.
ggplot(ccrb, aes(x = fct_infreq(Incident.Location))) +
geom_bar() + xlab("Incident.Location") + coord_flip()
Exploratory data analysis is very useful on data sets we are not familiar with. We can check the relationship between any variables as required. The outcome is quite clear and easy understand
Force leads to higher proportion of arrests whereas Abuse of Authority has a higher proportion of ‘No arrests or Summons’.
Most of the Incidents occur on Street/Highways followed far behind by Apartments/Houses.However, the incident frequency in Apartments/Homes is more than that of other public buildings,trains etc.
Manhattan has a strong association with all allegations except Abuse of Authority where there is a strong negative association.
Brooklyn followed by Bronx has the maxium number of reported incidents whereas Staten Island and Outside NYC have the least number of reported incidents.
Brooklyn also has maximum investigation that were not ‘full’
Most complaints received do not have video evidence
Abuse of Authority and Force are the most common allegations.
Compliant modes are mainly through Phone and Call Processing Systems.
Abuse of Authority and Force allegations have cases which have run longer than Offensive Language cases.