Exploratory Data Analysis of NYC Data Transparency Initiative data set

Data Reading

# Data
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(readxl)
## Warning: package 'readxl' was built under R version 3.3.3
ccrb <- read_excel("E:\\Harrisburg courses\\sem 6-Fall 2017\\Data Visualization\\lecture5\\ccrb_datatransparencyinitiative.xlsx", sheet = 'Complaints_Allegations') 
head(ccrb)
## # A tibble: 6 x 16
##    DateStamp UniqueComplaintId `Close Year` `Received Year`
##       <dttm>             <dbl>        <dbl>           <dbl>
## 1 2016-11-29                11         2006            2005
## 2 2016-11-29                18         2006            2004
## 3 2016-11-29                18         2006            2004
## 4 2016-11-29                18         2006            2004
## 5 2016-11-29                18         2006            2004
## 6 2016-11-29                18         2006            2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## #   Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## #   Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains
## #   Stop & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## #   Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial
## #   Contact` <chr>, `Allegation FADO Type` <chr>, `Allegation
## #   Description` <chr>
summary(ccrb)
##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                        NA's :0               NA's :0                     
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location 
##  Mode :logical                               Length:204397     
##  FALSE:119856                                Class :character  
##  TRUE :84541                                 Mode  :character  
##  NA's :0                                                       
##                                                                
##                                                                
##  Incident Year  Encounter Outcome  Reason For Initial Contact
##  Min.   :1999   Length:204397      Length:204397             
##  1st Qu.:2007   Class :character   Class :character          
##  Median :2009   Mode  :character   Mode  :character          
##  Mean   :2010                                                
##  3rd Qu.:2012                                                
##  Max.   :2016                                                
##  Allegation FADO Type Allegation Description
##  Length:204397        Length:204397         
##  Class :character     Class :character      
##  Mode  :character     Mode  :character      
##                                             
##                                             
## 
dim(ccrb)
## [1] 204397     16
library(ggplot2)
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.3.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.3.3
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# ccrb <- na.omit(ccrb)
# dim(ccrb)

Visualization 1: Observing Trend for number of complaints over years

# Data
vis1 <- ccrb %>% group_by(`Incident Year`) %>% summarize(num_case = n_distinct(`UniqueComplaintId`)) %>% select(`Incident Year`, num_case)

ggplot(data = vis1, aes(x = `Incident Year`, y = num_case)) + geom_line() + ggtitle('Number of Complaints vs Incident Year') + xlab('Incident Year') + ylab('Number of Incidents') + theme_minimal()

The about line plot helps us to understand the trend in incidences over year. We can see that there is decrease in number of incidences since 2008.

Visualization 2: Understanding the relation between Incident location and Allegation Type

# Data

ggplot(ccrb, aes(x=`Incident Location`, fill=`Allegation FADO Type`)) + 
  geom_histogram(stat = "count") + 
  labs(title="The FADO type of the allegation by Incident Location", x="Incident Location", y="Number of incidents") + 
  scale_fill_discrete(name="Allegation Type") + 
  theme(legend.position = "bottom") + coord_flip()
## Warning: Ignoring unknown parameters: binwidth, bins, pad

The above graph helps to visualize number of different types of incidents happened on different incident locations. From the above graph, it is clear that number of incidences happened on street/highway is higher than any other locations and then comes apartment/house location. It also tells us that most of the incidents happened on street/highway and apartment/house are Abuse of Authority type incidents.

Visualization 3: Number of Incident Occurence in Borough by year

ggplot(ccrb, aes(`Incident Year`, colour =`Borough of Occurrence`)) + geom_freqpoly(binwidth=1, na.rm = FALSE) + ggtitle("Incident Occurence in Borough by year") + labs(x="Incident Year", y="Number of incidents")

The above visualization 3 plot helps us to understand the number of incidents happened by year and boroughs. This plot shows that the incidences at Brooklyn is highest, But incident rate dropped since 2007 for Brooklyn, Bronx and Manhatten while rate at outside NYC and staten island remains almost same.

Visualization 4: Medium for Complaints Filing in each borough and over years

plot11<-ggplot(ccrb, aes(x=`Borough of Occurrence`, fill=`Complaint Filed Mode`)) + 
  geom_histogram(stat = "count") + 
  labs(title="Complaint Filed mode by FADO type of the allegation", x="Borough", y="Number of incidents") + 
  scale_fill_discrete(name="Complaint Filed Model") + 
  theme(legend.position = "bottom") 
## Warning: Ignoring unknown parameters: binwidth, bins, pad
plot12<-ggplot(ccrb, aes(x=`Received Year`, fill=`Complaint Filed Mode`)) + 
  geom_histogram(stat = "count") + 
  labs(title="Complaint Filed mode by FADO type of the allegation", x="Borough", y="Number of incidents") + 
  scale_fill_discrete(name="Complaint Filed Model") + 
  theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(plot11, plot12, ncol=2)

The above plots will help to answer question of how the complaints are filed in each borough as well as by year. This plot tells us that phone is more oftenly used to file complaints in all borough. Also, we can see that number of complaints were highest in 2007 and it is declining since 2008.

Visualization 5: Relationship between Received Year and Incident Year

ggplot(ccrb, aes(x=`Received Year`, y=`Incident Year`)) + 
  geom_point(color= "red") + 
  labs(title="Relationship between Received Year and Incident Year", x="Received Year", y="Incident Year")

The visualization 5 will tell us whether people filed complains immediately. The above scatter plot shows that most people inform about incidences immediatley or within 1 or 2 years of the incident occurrence. But we can see that two extreme outliers indicates that people filed 5 to 10 years after the incident occurred.

Visualization 6: Number of complaints filed at differnt locations

p <- ggplot(ccrb, aes(`Complaint Filed Place`, `Received Year`))
p + geom_boxplot() + coord_flip() +  ggtitle("Complaints by filed place")

The above box plot helps to understand the maximum number of complaints filed at which locations. Most of the complaints are filed at CCRB and ISB whereas other locations have lesser complaints. These box plots helps us to understand the distribution of complaints in terms of quartiles.

Visualization 7: Proportions of Complaint Outcome

vis6 <- table(ccrb$`Encounter Outcome`)
lbl <- c('Arrest', 'No Arrest or Summons', 'Other/NA', 'Summons')
pie3D(vis6, labels = lbl, radius = 1, col = c('magenta','red','yellow','skyblue'), main= 'Incident Outcome')

This pie chart is plotted to understand the outcome of the encounter and their proportions. It looks like most common outcome is arrest followed by no arrest or summons. Next visualization will help to understand if there is any relationship between outcome, video evidence and outcome.

Visualization 8: Relationship between Full investigation and encounter outcome with and without video Evidence

ccrb1 <- ccrb[ccrb$`Complaint Has Video Evidence` == 'TRUE',1:16]
ccrb2 <- ccrb[ccrb$`Complaint Has Video Evidence` == 'FALSE',1:16]

plot1 <- ggplot(ccrb1, aes(x=`Is Full Investigation`, fill=`Encounter Outcome`)) + 
  geom_histogram(stat = "count") + 
  labs(title="Relationship between Full investigation and encounter outcome with video Evidence", x="Full Investigation", y="Number of complaints") + 
  theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
plot2 <- ggplot(ccrb2, aes(x=`Is Full Investigation`, fill=`Encounter Outcome`)) + 
  geom_histogram(stat = "count") + 
  labs(title="Relationship between Full investigation and encounter outcome with no video evidence", x="Full Investigation", y="Number of complaints") + 
  theme(legend.position = "bottom") 
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(plot1, plot2, ncol=2)

Visualization 8 help us to understand if there is any relationship between encounter outcome, full investigation and video evidence. We can see that with most of the outcome is arrest when ther is full investigation. With video evidences,the number of full investigations are more compared to no full invetigations and number of no full investigation with video evidences is somewhat higher than full investigations without video evidence. It seems like video evidence does help investigation.

Visualization 9: Impact of allegation type on encounter outcome

ggplot(ccrb, aes(x=`Allegation FADO Type`, fill=`Encounter Outcome`)) + 
  geom_histogram(stat = "count") + 
  labs(title="Relationship between Allegation Type and encounter outcome", x="Allegation FADO Type", y="Number of complaints") + 
  theme(legend.position = "bottom")+  coord_flip()
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Above plot answers question of what is outcome of different allegation type. Here, this plot indicates that most of the arrest is happened for force allegation type and then for abuse of authority. Whereas majority of no arrest or summons is resulted for abuse of authority and then comes force. Abuse of authority is most frequent allegation followed by force.

Visualization 10: Relationship between Received Year and Close Year

ggplot(ccrb, aes(x=`Received Year`, y= `Close Year`)) + 
  geom_point(color='magenta') + 
  geom_smooth(method = lm) +
  labs(title="Relationship between Received Year and Close Year", x="Received Year", y="Close Year")

This visualization helps to understand how long does it take to close a complaint. The above scatter plot shows that most of the complaints are closed within 1- 2 years after complaint is filed.

Summary

Discovering trends and patterns in the data is the most important step in data analysis. Exploratory Data Analysis (EDA) and EDA tools helps people to understand the relationships, correlation and structure of the data set. Two main techniques involved in EDA are data visualization in graphical and numerical format. In the above 10 visualization, we have learned the relationship between different variables as well as understand patterns and trends. Different types of visualizations such as histograms, bar charts, pie charts, line charts, and box plots are used. By understanding these visalizations will help us to find a solution to improve the work efficiency of NYC agency or reduce the number of incidences. This analysis will help to understand complaints, frequenecy, significance of video evidence to perform investigations.

From above EDA analysis, one can conclude that street is most unsafe place and has highest number of complaints. Government can take more security precations in oder to reduce the number of complaints. Most of the complaints are filed without video evidence, Maximum number of incidences are occured in Brooklyn. Abuse of authority is most frequent allegation. Phone call is most popular medium to file complaints. CCRB is most poupular to file complaints. Video evidence helps to perform full investigations. Agency can allot more resources for this medium to improve efficiency. We also found that responding time to complaints is somewhat efficient but there is still need to improve the efficiecy in terms of investigation. This learning from analysis is definitely helpful to understand the data in more depth and reveal some new insights about the data.