# Data
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(readxl)
## Warning: package 'readxl' was built under R version 3.3.3
ccrb <- read_excel("E:\\Harrisburg courses\\sem 6-Fall 2017\\Data Visualization\\lecture5\\ccrb_datatransparencyinitiative.xlsx", sheet = 'Complaints_Allegations')
head(ccrb)
## # A tibble: 6 x 16
## DateStamp UniqueComplaintId `Close Year` `Received Year`
## <dttm> <dbl> <dbl> <dbl>
## 1 2016-11-29 11 2006 2005
## 2 2016-11-29 18 2006 2004
## 3 2016-11-29 18 2006 2004
## 4 2016-11-29 18 2006 2004
## 5 2016-11-29 18 2006 2004
## 6 2016-11-29 18 2006 2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## # Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## # Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains
## # Stop & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## # Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial
## # Contact` <chr>, `Allegation FADO Type` <chr>, `Allegation
## # Description` <chr>
summary(ccrb)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
## NA's :0 NA's :0
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location
## Mode :logical Length:204397
## FALSE:119856 Class :character
## TRUE :84541 Mode :character
## NA's :0
##
##
## Incident Year Encounter Outcome Reason For Initial Contact
## Min. :1999 Length:204397 Length:204397
## 1st Qu.:2007 Class :character Class :character
## Median :2009 Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Allegation FADO Type Allegation Description
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
dim(ccrb)
## [1] 204397 16
library(ggplot2)
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.3.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.3.3
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# ccrb <- na.omit(ccrb)
# dim(ccrb)
# Data
vis1 <- ccrb %>% group_by(`Incident Year`) %>% summarize(num_case = n_distinct(`UniqueComplaintId`)) %>% select(`Incident Year`, num_case)
ggplot(data = vis1, aes(x = `Incident Year`, y = num_case)) + geom_line() + ggtitle('Number of Complaints vs Incident Year') + xlab('Incident Year') + ylab('Number of Incidents') + theme_minimal()
The about line plot helps us to understand the trend in incidences over year. We can see that there is decrease in number of incidences since 2008.
# Data
ggplot(ccrb, aes(x=`Incident Location`, fill=`Allegation FADO Type`)) +
geom_histogram(stat = "count") +
labs(title="The FADO type of the allegation by Incident Location", x="Incident Location", y="Number of incidents") +
scale_fill_discrete(name="Allegation Type") +
theme(legend.position = "bottom") + coord_flip()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The above graph helps to visualize number of different types of incidents happened on different incident locations. From the above graph, it is clear that number of incidences happened on street/highway is higher than any other locations and then comes apartment/house location. It also tells us that most of the incidents happened on street/highway and apartment/house are Abuse of Authority type incidents.
ggplot(ccrb, aes(`Incident Year`, colour =`Borough of Occurrence`)) + geom_freqpoly(binwidth=1, na.rm = FALSE) + ggtitle("Incident Occurence in Borough by year") + labs(x="Incident Year", y="Number of incidents")
The above visualization 3 plot helps us to understand the number of incidents happened by year and boroughs. This plot shows that the incidences at Brooklyn is highest, But incident rate dropped since 2007 for Brooklyn, Bronx and Manhatten while rate at outside NYC and staten island remains almost same.
plot11<-ggplot(ccrb, aes(x=`Borough of Occurrence`, fill=`Complaint Filed Mode`)) +
geom_histogram(stat = "count") +
labs(title="Complaint Filed mode by FADO type of the allegation", x="Borough", y="Number of incidents") +
scale_fill_discrete(name="Complaint Filed Model") +
theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
plot12<-ggplot(ccrb, aes(x=`Received Year`, fill=`Complaint Filed Mode`)) +
geom_histogram(stat = "count") +
labs(title="Complaint Filed mode by FADO type of the allegation", x="Borough", y="Number of incidents") +
scale_fill_discrete(name="Complaint Filed Model") +
theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(plot11, plot12, ncol=2)
The above plots will help to answer question of how the complaints are filed in each borough as well as by year. This plot tells us that phone is more oftenly used to file complaints in all borough. Also, we can see that number of complaints were highest in 2007 and it is declining since 2008.
ggplot(ccrb, aes(x=`Received Year`, y=`Incident Year`)) +
geom_point(color= "red") +
labs(title="Relationship between Received Year and Incident Year", x="Received Year", y="Incident Year")
The visualization 5 will tell us whether people filed complains immediately. The above scatter plot shows that most people inform about incidences immediatley or within 1 or 2 years of the incident occurrence. But we can see that two extreme outliers indicates that people filed 5 to 10 years after the incident occurred.
p <- ggplot(ccrb, aes(`Complaint Filed Place`, `Received Year`))
p + geom_boxplot() + coord_flip() + ggtitle("Complaints by filed place")
The above box plot helps to understand the maximum number of complaints filed at which locations. Most of the complaints are filed at CCRB and ISB whereas other locations have lesser complaints. These box plots helps us to understand the distribution of complaints in terms of quartiles.
vis6 <- table(ccrb$`Encounter Outcome`)
lbl <- c('Arrest', 'No Arrest or Summons', 'Other/NA', 'Summons')
pie3D(vis6, labels = lbl, radius = 1, col = c('magenta','red','yellow','skyblue'), main= 'Incident Outcome')
This pie chart is plotted to understand the outcome of the encounter and their proportions. It looks like most common outcome is arrest followed by no arrest or summons. Next visualization will help to understand if there is any relationship between outcome, video evidence and outcome.
ccrb1 <- ccrb[ccrb$`Complaint Has Video Evidence` == 'TRUE',1:16]
ccrb2 <- ccrb[ccrb$`Complaint Has Video Evidence` == 'FALSE',1:16]
plot1 <- ggplot(ccrb1, aes(x=`Is Full Investigation`, fill=`Encounter Outcome`)) +
geom_histogram(stat = "count") +
labs(title="Relationship between Full investigation and encounter outcome with video Evidence", x="Full Investigation", y="Number of complaints") +
theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
plot2 <- ggplot(ccrb2, aes(x=`Is Full Investigation`, fill=`Encounter Outcome`)) +
geom_histogram(stat = "count") +
labs(title="Relationship between Full investigation and encounter outcome with no video evidence", x="Full Investigation", y="Number of complaints") +
theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(plot1, plot2, ncol=2)
Visualization 8 help us to understand if there is any relationship between encounter outcome, full investigation and video evidence. We can see that with most of the outcome is arrest when ther is full investigation. With video evidences,the number of full investigations are more compared to no full invetigations and number of no full investigation with video evidences is somewhat higher than full investigations without video evidence. It seems like video evidence does help investigation.
ggplot(ccrb, aes(x=`Allegation FADO Type`, fill=`Encounter Outcome`)) +
geom_histogram(stat = "count") +
labs(title="Relationship between Allegation Type and encounter outcome", x="Allegation FADO Type", y="Number of complaints") +
theme(legend.position = "bottom")+ coord_flip()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Above plot answers question of what is outcome of different allegation type. Here, this plot indicates that most of the arrest is happened for force allegation type and then for abuse of authority. Whereas majority of no arrest or summons is resulted for abuse of authority and then comes force. Abuse of authority is most frequent allegation followed by force.
ggplot(ccrb, aes(x=`Received Year`, y= `Close Year`)) +
geom_point(color='magenta') +
geom_smooth(method = lm) +
labs(title="Relationship between Received Year and Close Year", x="Received Year", y="Close Year")
This visualization helps to understand how long does it take to close a complaint. The above scatter plot shows that most of the complaints are closed within 1- 2 years after complaint is filed.
Discovering trends and patterns in the data is the most important step in data analysis. Exploratory Data Analysis (EDA) and EDA tools helps people to understand the relationships, correlation and structure of the data set. Two main techniques involved in EDA are data visualization in graphical and numerical format. In the above 10 visualization, we have learned the relationship between different variables as well as understand patterns and trends. Different types of visualizations such as histograms, bar charts, pie charts, line charts, and box plots are used. By understanding these visalizations will help us to find a solution to improve the work efficiency of NYC agency or reduce the number of incidences. This analysis will help to understand complaints, frequenecy, significance of video evidence to perform investigations.
From above EDA analysis, one can conclude that street is most unsafe place and has highest number of complaints. Government can take more security precations in oder to reduce the number of complaints. Most of the complaints are filed without video evidence, Maximum number of incidences are occured in Brooklyn. Abuse of authority is most frequent allegation. Phone call is most popular medium to file complaints. CCRB is most poupular to file complaints. Video evidence helps to perform full investigations. Agency can allot more resources for this medium to improve efficiency. We also found that responding time to complaints is somewhat efficient but there is still need to improve the efficiecy in terms of investigation. This learning from analysis is definitely helpful to understand the data in more depth and reveal some new insights about the data.