Basic Exploratory Data Analysis

Use the head and str commands for the basic EDA analysis. Remove duplicate data using unique command.

library(ggplot2)
library(ggthemes)
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(ggalt)
library(dygraphs)
library(xts)
library(plyr)
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## arrange():   dplyr, plyr
## compact():   purrr, plyr
## count():     dplyr, plyr
## failwith():  dplyr, plyr
## filter():    dplyr, stats
## first():     dplyr, xts
## id():        dplyr, plyr
## lag():       dplyr, stats
## last():      dplyr, xts
## mutate():    dplyr, plyr
## rename():    dplyr, plyr
## summarise(): dplyr, plyr
## summarize(): dplyr, plyr
ccrb <- read.csv("C:/Users/mks/Google Drive/HU/ANLY512-50/ccrb.csv")
head(ccrb)
##    DateStamp UniqueComplaintId Close.Year Received.Year
## 1 11/29/2016                11       2006          2005
## 2 11/29/2016                18       2006          2004
## 3 11/29/2016                18       2006          2004
## 4 11/29/2016                18       2006          2004
## 5 11/29/2016                18       2006          2004
## 6 11/29/2016                18       2006          2004
##   Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
## 1             Manhattan                 FALSE                        FALSE
## 2              Brooklyn                  TRUE                        FALSE
## 3              Brooklyn                  TRUE                        FALSE
## 4              Brooklyn                  TRUE                        FALSE
## 5              Brooklyn                  TRUE                        FALSE
## 6              Brooklyn                  TRUE                        FALSE
##   Complaint.Filed.Mode Complaint.Filed.Place
## 1      On-line website                  CCRB
## 2                Phone                  CCRB
## 3                Phone                  CCRB
## 4                Phone                  CCRB
## 5                Phone                  CCRB
## 6                Phone                  CCRB
##   Complaint.Contains.Stop...Frisk.Allegations Incident.Location
## 1                                       FALSE    Street/highway
## 2                                       FALSE    Street/highway
## 3                                       FALSE    Street/highway
## 4                                       FALSE    Street/highway
## 5                                       FALSE    Street/highway
## 6                                       FALSE    Street/highway
##   Incident.Year    Encounter.Outcome
## 1          2005 No Arrest or Summons
## 2          2004               Arrest
## 3          2004               Arrest
## 4          2004               Arrest
## 5          2004               Arrest
## 6          2004               Arrest
##                     Reason.For.Initial.Contact Allegation.FADO.Type
## 1                                        Other   Abuse of Authority
## 2 PD suspected C/V of violation/crime - street   Abuse of Authority
## 3 PD suspected C/V of violation/crime - street          Discourtesy
## 4 PD suspected C/V of violation/crime - street          Discourtesy
## 5 PD suspected C/V of violation/crime - street          Discourtesy
## 6 PD suspected C/V of violation/crime - street                Force
##                Allegation.Description
## 1                    Threat of arrest
## 2 Refusal to obtain medical treatment
## 3                                Word
## 4                                Word
## 5                                Word
## 6                      Physical force
str(ccrb)
## 'data.frame':    204397 obs. of  16 variables:
##  $ DateStamp                                  : Factor w/ 1 level "11/29/2016": 1 1 1 1 1 1 1 1 1 1 ...
##  $ UniqueComplaintId                          : int  11 18 18 18 18 18 18 18 18 18 ...
##  $ Close.Year                                 : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ Received.Year                              : int  2005 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
##  $ Borough.of.Occurrence                      : Factor w/ 6 levels "Bronx","Brooklyn",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ Is.Full.Investigation                      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint.Has.Video.Evidence               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint.Filed.Mode                       : Factor w/ 7 levels "Call Processing System",..: 6 7 7 7 7 7 7 7 7 7 ...
##  $ Complaint.Filed.Place                      : Factor w/ 14 levels "CCRB","Comm. to Combat Police Corruption",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Complaint.Contains.Stop...Frisk.Allegations: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident.Location                          : Factor w/ 15 levels "Apartment/house",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ Incident.Year                              : int  2005 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
##  $ Encounter.Outcome                          : Factor w/ 4 levels "Arrest","No Arrest or Summons",..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ Reason.For.Initial.Contact                 : Factor w/ 49 levels "Aided case","Arrest/Complainant",..: 23 32 32 32 32 32 32 32 32 32 ...
##  $ Allegation.FADO.Type                       : Factor w/ 4 levels "Abuse of Authority",..: 1 1 2 2 2 3 3 3 3 3 ...
##  $ Allegation.Description                     : Factor w/ 56 levels "Action","Animal",..: 48 35 56 56 56 27 27 27 27 27 ...
ccrb <- unique(ccrb)

Graph showing the distribution of full investigation

ggplot(ccrb, aes(x = Borough.of.Occurrence, fill = Is.Full.Investigation)) + geom_bar(stat = 'count') + labs(title = "Full Investigation True or False", x = "Location", Y = "Count") + theme_economist()

Bar Plots - Univariate Frequency analysis

To know the most frequent levels of categories, barplots are useful.The following bar plots give us an understanding of complaints in CCRB.

Borough.Of.Occurence

The Borough of Occurence variable is arranged in descending order of frequency counts which indicates that Brooklyn has the maximum number of incident occurences followed by Bronx,Manhattan and so forth. Outside NYC is negligible. This may be because people outside NYC may not file complaints at CCRB which is an NYC government organization.

library(tidyverse)
library(ggplot2)
library(forcats)
library(dplyr)
ggplot(ccrb, aes(x = fct_infreq(Borough.of.Occurrence))) +
  geom_bar() + xlab("Borough.of.Occurrence")

Distribution of complaints with video evidence

ggplot(ccrb, aes(x = Borough.of.Occurrence, fill = Complaint.Has.Video.Evidence)) + geom_bar(stat = 'count') + labs(title = "Complaints have video evidence or not", x = "Location", Y = "Count") + theme_economist()

### Distribution of Incident Locations

p2data<-unique(data.frame(ccrb$UniqueComplaintId,ccrb$Incident.Location))
p2<-table(p2data$ccrb.Incident.Location)
p2<-p2/sum(p2)*100
p2<-p2[order(-p2)]
barplot(p2,ylab="Count (%)",ylim=c(0,60),las=2,main="Count of Incident Locations")

Distribution of Communication Method

p3data<-unique(data.frame(ccrb$UniqueComplaintId,ccrb$Complaint.Filed.Mode,ccrb$Allegation.FADO.Type))
p3<-table(p3data$ccrb.Complaint.Filed.Mode)
p3<-p3/sum(p3)*100
p3<-p3[order(-p3)]
barplot(p3,xlab="Communication Method",ylab="Count (%)",ylim=c(0,100))

Complaint Filing Modes

Complaint Filing Modes are arranged in descending order of frequency counts which indicate that the maximum number of complaints are received by Phone followed by Call Processing System and Online Website.The least popular mode of filing complaints is fax followed by Email. Interpreting this in the real world is that when we have an urgency we generally call the repective organization. If we cannot reach them, we leave a message. This explains the two popular modes.The least popular mode-fax is no longer used by many people and email is not the mode you would use for immediate reponse. However, mail is popular compared to email as people prefer to have a receipt for a written hard copy of a filed complaint as proof.

ccrb %>% mutate(Complaint.Filed.Mode = Complaint.Filed.Mode %>% fct_infreq() %>% fct_rev()) %>% 
ggplot(aes(Complaint.Filed.Mode)) +
geom_bar() + coord_flip() + xlab("Complaint.Filed.Mode")

Encounter Outcome

The outcome of the encounter is arranged in descending order of frequency counts which indicate a surprising finding that No Arrest or Summons has a somewhat higher frequency than Arrest.

ggplot(ccrb, aes(x = fct_infreq(Encounter.Outcome))) +
  geom_bar() + xlab("Encounter.Outcome")

Classification of Allegations

The Allegation Types are arranged in descending order of frequency counts which indicate that Abuse of Authority is the most common Allegation type followed far behind by Force.

ggplot(ccrb, aes(x = fct_infreq(Allegation.FADO.Type))) +
  geom_bar() + xlab("Allegation.FADO.Type")

Incident Location

When the incident location is arranged in descending order of frequency counts, we see that the most common location for incidents is the street/highway.However, the surprsing finding is that it is followed by Apartment/house. Though there is a huge difference between the counts for street/higway and Apartment/house, it is still scary to know that more incidents occur in Apartments/house compared to public buildings and trains.

ggplot(ccrb, aes(x = fct_infreq(Incident.Location))) +
  geom_bar() + xlab("Incident.Location") + coord_flip()

Share of Allegation Type by Complaint Filed Place

p10<-unique(data.frame(ccrb$UniqueComplaintId,ccrb$Complaint.Filed.Place,ccrb$Allegation.FADO.Type))
p10.pic<-p10 %>%
        group_by(ccrb.Allegation.FADO.Type,ccrb.Complaint.Filed.Place) %>%
        summarise(count=n()) %>%
        mutate(percent=count/sum(count))

p100<-ggplot(p10.pic,aes(ccrb.Allegation.FADO.Type,y=percent,fill=ccrb.Complaint.Filed.Place))+
    geom_bar(stat="identity",width=0.5)+ scale_fill_discrete(name = "Share of Complaint.Filed.Place")+
    ggtitle("Share of Allegation Type by Complaint Filed Place")+
    theme(legend.position = "bottom")+
    xlab("Allegation Type")+
    theme(plot.title = element_text(hjust = 0.5))
p100

Summary

Exploratory data analysis is very useful on data sets we are not familiar with. We can check the relationship between any variables as required. The outcome is quite clear and easy understand