Objectives

The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).

For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment, you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

Data Upload

library(readxl)
ccrb <- read_excel("C:/Users/mruga/Desktop/ccrb.xlsx")
summary(ccrb)
##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location  Incident Year 
##  Mode :logical                               Length:204397      Min.   :1999  
##  FALSE:119856                                Class :character   1st Qu.:2007  
##  TRUE :84541                                 Mode  :character   Median :2009  
##                                                                 Mean   :2010  
##                                                                 3rd Qu.:2012  
##                                                                 Max.   :2016  
##  Encounter Outcome  Reason For Initial Contact Allegation FADO Type
##  Length:204397      Length:204397              Length:204397       
##  Class :character   Class :character           Class :character    
##  Mode  :character   Mode  :character           Mode  :character    
##                                                                    
##                                                                    
##                                                                    
##  Allegation Description
##  Length:204397         
##  Class :character      
##  Mode  :character      
##                        
##                        
## 
str(ccrb)
## tibble [204,397 x 16] (S3: tbl_df/tbl/data.frame)
##  $ DateStamp                                  : POSIXct[1:204397], format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num [1:204397] 11 18 18 18 18 18 18 18 18 18 ...
##  $ Close Year                                 : num [1:204397] 2006 2006 2006 2006 2006 ...
##  $ Received Year                              : num [1:204397] 2005 2004 2004 2004 2004 ...
##  $ Borough of Occurrence                      : chr [1:204397] "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is Full Investigation                      : logi [1:204397] FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint Has Video Evidence               : logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint Filed Mode                       : chr [1:204397] "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint Filed Place                      : chr [1:204397] "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint Contains Stop & Frisk Allegations: logi [1:204397] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident Location                          : chr [1:204397] "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident Year                              : num [1:204397] 2005 2004 2004 2004 2004 ...
##  $ Encounter Outcome                          : chr [1:204397] "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason For Initial Contact                 : chr [1:204397] "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation FADO Type                       : chr [1:204397] "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation Description                     : chr [1:204397] "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...
dim(ccrb)
## [1] 204397     16

1) Complaints per year by allegation

This graph will show the trend of complaints over the year split by the type of allegations

ggplot(ccrb, aes(x=`Received Year`, fill= `Allegation FADO Type`)) +
geom_histogram(stat = "count") + 
labs (title = "Number of Complaints Received Each Year", x="Received Year", y="Number of Complaints") + 
scale_fill_discrete(name = "Allegation Type")

2) Complaints per borough

This graph tells us the total number of allegations in each borough. Here Brooklyn has the most.

borough = table(ccrb$`Borough of Occurrence`)
lbls = names(borough)
barplot(borough, 
        xlab = "Borough of Occurrence", 
        ylab = "Number", 
        main = "Borough of Occurrence in CCRB Report", 
        horiz = FALSE,
        legend.text = TRUE,
        col=rainbow(length(lbls)))

3) Total mode of complaints

This graph tells us which is the most prefered mode of communication for a complaint. Unsurprizingly, phone is the most used.

ggplot(ccrb, aes(`Complaint Filed Mode`, fill=`Complaint Filed Mode`)) + geom_histogram(stat = "count") + 
labs(title="Mode of complaints filed", x="Mode", y="Total") + scale_fill_discrete(name="Complaint Filed Mode")

4) Cases with or without Surviellance

This graph shows how many complaints have a video evidance. As we see, very few do and it only started from 2011 showing either lack of cameras or lack of use of videos as evidence.

ggplot(ccrb,aes(x = `Received Year`, fill = `Complaint Has Video Evidence`)) +
  geom_bar(stat = "Count") +
  labs (title = "Number of complaints that video evidence")

5) Complaints Received vs Closing Year

This scatter plot shows the relationship between complaint year and when it was closed. WE see that most complaints are closed fast.

scatter = ggplot(ccrb, aes(`Received Year`, `Close Year`))

scatter + geom_point() +
          geom_smooth(method = 'lm', color = 'red') +
          xlab('Complaints Received Year') + 
          ylab('Complaints Closed Year') + 
          ggtitle('Complaints Receiving and Closing Year')

6) Complaints by incident location

This graphs tells us where the incidents are occuring. Streets are by far the highest.

ggplot(ccrb, aes(`Incident Location`)) + 
  geom_histogram(stat = "count") + 
  labs(title="Number of Complaints by Incident Location", x="Incident Location", y="Total") + 
   theme(axis.text.x = element_text(angle = 90, hjust = 1))

7) Encounter Outcomes by Borough

This graph tries to shed a light over the trend of outcomes based on the borough you are in

ggplot(ccrb,aes(x = `Encounter Outcome`, 
             fill = `Borough of Occurrence`)) +
  geom_bar(stat = "count") +
  labs(title = "Encounter Outcomes by borough")

8) Outcomes by Year

This graph tries to show if the trend of outcomes has changed over time.

ggplot(ccrb, aes(x=`Received Year`, fill= `Encounter Outcome`)) +
geom_histogram(stat = "count") + 
labs (title = "Ouctomes Each Year", x="Received Year", y="Number of Complaints") + 
scale_fill_discrete(name = "Outcome Type")

9) Outcome vs Full Investigation

This is a very useful graph to see how many arrests are being made without full investigation.

ggplot(ccrb, aes(x = `Is Full Investigation`, fill = ccrb$`Encounter Outcome` )) + 
 geom_bar(stat = 'count') + 
 labs(title='Investigation vs Outcome') + 
 scale_fill_discrete(name = 'Encounter Outcome') 

10) Encounter outcome by %

The pie chart shows the portion of incidents that end up in arrests vs non arrests mainly.

encounter = table(ccrb$`Encounter Outcome`)
lbls <- names(encounter)
percent <- round(encounter/sum(encounter)*100)
lbls <- paste(lbls, percent) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
plot1 = pie(encounter,labels = lbls,main="Encounter Outcome by %")

Summary

We are given the Civilian Complain Review Board data for NYC. A quick glance shows that there are 204397 records and 16 variables. Data is recorded accross 7 boroughs including NA. The exploratory data analysis has revealed a few things. 1) Brooklyn sees the most incidents 2) Phone is by far the most common mode of communication for complaints 3) 44% of the incidents end in arrests which is higher than all other outcomes. 4) A full investigation has higher rates of arrest. 5) Complaints with incidents on streets/highways are the highest.