Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:

library(readxl)
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(ggthemes)
library(forcats)
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
ccrb <- read_excel("C:/Users/khush/Downloads/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
ccrb
## # A tibble: 204,397 x 16
##     DateStamp UniqueComplaintId Close.Year Received.Year
##        <dttm>             <dbl>      <dbl>         <dbl>
##  1 2016-11-29                11       2006          2005
##  2 2016-11-29                18       2006          2004
##  3 2016-11-29                18       2006          2004
##  4 2016-11-29                18       2006          2004
##  5 2016-11-29                18       2006          2004
##  6 2016-11-29                18       2006          2004
##  7 2016-11-29                18       2006          2004
##  8 2016-11-29                18       2006          2004
##  9 2016-11-29                18       2006          2004
## 10 2016-11-29                18       2006          2004
## # ... with 204,387 more rows, and 12 more variables:
## #   Borough.of.Occurrence <chr>, Is.Full.Investigation <lgl>,
## #   Complaint.Has.Video.Evidence <lgl>, Complaint.Filed.Mode <chr>,
## #   Complaint.Filed.Place <chr>,
## #   `Complaint.ContainsStop&FriskAllegations` <lgl>,
## #   Incident.Location <chr>, Incident.Year <dbl>, Encounter.Outcome <chr>,
## #   Reason.For.Initial.Contact <chr>, Allegation.FADO.Type <chr>,
## #   Allegation.Description <chr>
summary(ccrb)
##    DateStamp          UniqueComplaintId   Close.Year   Received.Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint.Filed.Mode Complaint.Filed.Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint.ContainsStop&FriskAllegations Incident.Location  Incident.Year 
##  Mode :logical                           Length:204397      Min.   :1999  
##  FALSE:119856                            Class :character   1st Qu.:2007  
##  TRUE :84541                             Mode  :character   Median :2009  
##                                                             Mean   :2010  
##                                                             3rd Qu.:2012  
##                                                             Max.   :2016  
##  Encounter.Outcome  Reason.For.Initial.Contact Allegation.FADO.Type
##  Length:204397      Length:204397              Length:204397       
##  Class :character   Class :character           Class :character    
##  Mode  :character   Mode  :character           Mode  :character    
##                                                                    
##                                                                    
##                                                                    
##  Allegation.Description
##  Length:204397         
##  Class :character      
##  Mode  :character      
##                        
##                        
## 

View Data

ccrb %>% count(Received.Year)
## # A tibble: 17 x 2
##    Received.Year     n
##            <dbl> <int>
##  1          1999    13
##  2          2000     6
##  3          2002    15
##  4          2003    25
##  5          2004   864
##  6          2005 11935
##  7          2006 23864
##  8          2007 24384
##  9          2008 22191
## 10          2009 21365
## 11          2010 17817
## 12          2011 16454
## 13          2012 15683
## 14          2013 14799
## 15          2014 13658
## 16          2015 12680
## 17          2016  8644
ccrb %>% count(Borough.of.Occurrence)
## # A tibble: 7 x 2
##   Borough.of.Occurrence     n
##                   <chr> <int>
## 1                 Bronx 49442
## 2              Brooklyn 72215
## 3             Manhattan 42104
## 4                    NA   483
## 5           Outside NYC   170
## 6                Queens 30883
## 7         Staten Island  9100

1st Visualization

Visualising number of complaints received each year.

complaint1<-sqldf('select distinct UniqueComplaintId,"Received.Year" as Year from ccrb')

ggplot(complaint1) +
  geom_line(aes(Year, group = 1),
            stat = 'count',
            color = 'darkblue',
            size = 1.5) +
  theme_economist() +
  labs(title = 'Fig.1: Complaints Received in Each Year',
       x = 'Year',
       y = 'Complaints Received')

As we can observe from the above graph that number of complaints gradually reduced.

2nd Visualization

we need to observe the number of incident received per year on the basis of borough.

ggplot(ccrb, aes(x=ccrb$Received.Year, fill= ccrb$Borough.of.Occurrence)) + geom_bar(stat = "count") + labs (title = "Borough of Occurrence by Received Year", x="Year", y="Number") + scale_fill_discrete(name = "Borough of Occurrence")

From the above graph we can say that most of the incident was received against brooklyn borough.

3rd Visualization

Now we need to analyse that where do people mostly complaint regarding incident.

ggplot(ccrb, aes(x=ccrb$Received.Year, fill= ccrb$Complaint.Filed.Place)) + geom_bar(stat = "count") + labs (title = "Received Year by Complaint Filed Place", x="Received Year", y="Number") + scale_fill_discrete(name = "Complaint Filed Place")
## Warning: position_stack requires non-overlapping x intervals

We can observe that most of the complaints were filled in CCRB.

4th Visualization

ggplot(ccrb, aes(x=ccrb$Borough.of.Occurrence, fill= ccrb$Complaint.Filed.Place)) + geom_bar(stat = "count") + labs (title = "Borough of Occurrence by Complaint Filed Place", x="Borough of Occurrence", y="Number") + scale_fill_discrete(name = "Complaint Filed Place")

5th Visualization

ggplot(ccrb,aes(x=ccrb$Received.Year, fill=ccrb$Allegation.FADO.Type)) + geom_bar(stat = "count") + labs (title = "Allegation FADO Type Each Year", x="Year", y="Complaints")+ scale_fill_discrete(name = "Allegation Fado Type")+scale_x_continuous(breaks = seq(1999,2016,2))

We can see that most of the allegation fado type was “Abuse of authority” but with time the ratio came down

6th Visualization

Nextly I would like to see the location where the incident occured and the fado type

ggplot(ccrb, aes(x=ccrb$Incident.Location, fill = ccrb$Allegation.FADO.Type ))+geom_bar(stat="count")+ labs (title = "Number of Incident Occurred by FADO Type", x="Incident Location", y="Number Of Incidents") + theme (legend.position = "bottom") + scale_fill_discrete(name = "FADO Type")+scale_y_continuous(breaks = seq(0,125000,4000))+theme(axis.text.x = element_text(angle = 90, hjust = 1))+coord_flip()

Above graph shows that most of the incident occurred in street/highway and apartment where the FADO type is “Abuse of authority”

7th Visualization

Now,let’s analyse that whether the complaint was filled at the same time when the incident occurred.

fit1 <- lm(ccrb$Incident.Year~ccrb$Received.Year)
plot(ccrb$Incident.Year,ccrb$Received.Year, main = "Incident Year & Received Year", xlab = "Incident Year", ylab = "Received Year", pch = 20)
abline(fit1,col="blue")

We can see that there are two outlier which means that the complaint for this incident was filled after a long time.

8th Visualization

This visualization will help us to know how long it took to close the complaint

ggplot(ccrb,aes(ccrb$Received.Year,ccrb$Close.Year))+geom_point() + geom_smooth(method = lm) + labs (title = "Received Year Vs Close Year", y="Close Year", x="Received Year")+scale_x_continuous(breaks = seq(1999,2016,2))+scale_y_continuous(breaks = seq(1999,2016,2))+theme(panel.grid.minor = element_line(colour="white", size=0.5)) 

9th Visualization

It is also important to know the the mode of filling the complaint.

ggplot(ccrb) +
  geom_bar(aes(Received.Year, fill = Complaint.Filed.Mode),
            stat = 'count') +
 scale_fill_discrete() +
  labs(title = 'Fig.3: Complaints Received by Filed Mode in Each Year',
       x = 'Year',
       y = 'Complaints Received')

We can observe that most of the complaints were filled via phone.

10th Visualization

Now would like to analyse that whether each complaint has video as an evidence.

ggplot(ccrb, aes(x=ccrb$UniqueComplaintId, fill= ccrb$Complaint.Has.Video.Evidence)) + geom_bar(stat = "count") + labs (title = "UniqueComplaintId by Complaint Has Video Evidence", x="UniqueComplaintId", y="Number") + scale_fill_discrete(name = "Has Video Evidence")

Above graph states that not most of the graph has video as an evidence.

11th Visualization

Now we need to analyse that whther complaint was fully investigated and does it have any video evidence.

ggplot(ccrb, aes(x=ccrb$Is.Full.Investigation, fill= ccrb$Complaint.Has.Video.Evidence)) + geom_bar(stat = "count") + labs (title = "Investigation by Evidence", x="Is Full Investigation", y="Number") + scale_fill_discrete(name = "Has Video Evidence")

From the above graph we can observe that Most of the complaints which were fully investigated doesn’t have video evidence but we can also observe that there were some cases which had video evidence which might have helped to complete the investigation. The complaints which were not fully investigated had very few number of video evidence

12th Visualization

Now I would like to analyse number of complaints were fully investigated each year.

ggplot(ccrb, aes(x=ccrb$Close.Year, fill= ccrb$Is.Full.Investigation)) + geom_bar(stat = "count") + labs (title = "Number of Case Closed Each Year by Investigation", x="Close Year", y="Number") +  scale_fill_discrete(name = "Is Full Investigation")

13th Visualization

At last I would like to observe what was the outcome after the case was closed.

ggplot(ccrb, aes(x=ccrb$Close.Year, fill= ccrb$Encounter.Outcome)) + geom_bar(stat = "count") + labs (title = "Close Year by Outcome", x="Close Year", y="Number") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Encountered Outcome")

Summary

-Number of complaints gradually reduced after 2009. -People prefer to file complaint in CCRB but some incidents were reported after few months/years from the date of incident -Most of the incident was received against brooklyn borough. -Most of the allegation fado type was “Abuse of authority”. -Most of the incident occurred in street/highway and apartment where the FADO type is “Abuse of authority”.