The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment, you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
rm(list = ls())
library(readxl)
## Warning: package 'readxl' was built under R version 3.6.3
eda_data = read_xlsx("C:\\Users\\Administrator\\Desktop\\ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
dim(eda_data)
## [1] 204397 16
str(eda_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 204397 obs. of 16 variables:
## $ DateStamp : POSIXct, format: "2016-11-29" "2016-11-29" ...
## $ UniqueComplaintId : num 11 18 18 18 18 18 18 18 18 18 ...
## $ Close Year : num 2006 2006 2006 2006 2006 ...
## $ Received Year : num 2005 2004 2004 2004 2004 ...
## $ Borough of Occurrence : chr "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
## $ Is Full Investigation : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ Complaint Has Video Evidence : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Complaint Filed Mode : chr "On-line website" "Phone" "Phone" "Phone" ...
## $ Complaint Filed Place : chr "CCRB" "CCRB" "CCRB" "CCRB" ...
## $ Complaint Contains Stop & Frisk Allegations: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Incident Location : chr "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
## $ Incident Year : num 2005 2004 2004 2004 2004 ...
## $ Encounter Outcome : chr "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
## $ Reason For Initial Contact : chr "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
## $ Allegation FADO Type : chr "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
## $ Allegation Description : chr "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...
names(eda_data)
## [1] "DateStamp"
## [2] "UniqueComplaintId"
## [3] "Close Year"
## [4] "Received Year"
## [5] "Borough of Occurrence"
## [6] "Is Full Investigation"
## [7] "Complaint Has Video Evidence"
## [8] "Complaint Filed Mode"
## [9] "Complaint Filed Place"
## [10] "Complaint Contains Stop & Frisk Allegations"
## [11] "Incident Location"
## [12] "Incident Year"
## [13] "Encounter Outcome"
## [14] "Reason For Initial Contact"
## [15] "Allegation FADO Type"
## [16] "Allegation Description"
summary(eda_data)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location Incident Year
## Mode :logical Length:204397 Min. :1999
## FALSE:119856 Class :character 1st Qu.:2007
## TRUE :84541 Mode :character Median :2009
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Encounter Outcome Reason For Initial Contact Allegation FADO Type
## Length:204397 Length:204397 Length:204397
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Allegation Description
## Length:204397
## Class :character
## Mode :character
##
##
##
Viz:1- The following bar chart shows distribution of incidents occurence in different boroughs. From the graph we can see that most of the incident come from Brooklyn and least from outside NYC.
borough = table(eda_data$`Borough of Occurrence`)
lbls = names(borough)
barplot(borough,
xlab = "Borough of Occurrence",
ylab = "Number",
main = "Borough of Occurrence in CCRB Report",
horiz = FALSE,
legend.text = TRUE,
cex.axis = 1.0,
cex.names = 1.0,
col=rainbow(length(lbls)))
Viz:2 - The following graph shows percentage distribution of different modes in which complaints were filed. The pie-chart indicates that most of the complaints were filed through phone and then through Call Processing System.
complaint_mode = table(eda_data$`Complaint Filed Mode`)
lbls <- names(complaint_mode)
lbls
## [1] "Call Processing System" "E-mail" "Fax"
## [4] "In-person" "Mail" "On-line website"
## [7] "Phone"
pct <- round(complaint_mode/sum(complaint_mode)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(complaint_mode,labels = lbls, col=rainbow(length(lbls)),main="Complaints Filed Mode in CCRB Report")
Viz:3 - The following scatter plot shows realtionship between complaints receiving and closing year along with regression line.
library(ggplot2)
cleanup = theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line.x = element_line(color = 'black'),
axis.line.y = element_line(color = 'black'),
legend.key = element_rect(fill = 'white'),
text = element_text(size = 15))
scatter = ggplot(eda_data, aes(eda_data$`Received Year`, eda_data$`Close Year`))
scatter + geom_point() +
geom_smooth(method = 'lm', color = 'blue') +
xlab('Complaints Received Year') +
ylab('Complaints Closed Year') +
ggtitle('Complaints Receiving and Closing Year in CCRB Report') +
cleanup
Viz:4- The below graph shows stacked bar chart indicating distribution of complaints over different boroughs through various complaint modes. We can see that most of the complaints came from Brooklyn and maximum complaints were filed over phone, then call processing system, etc.
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.6.3
stack_plot = ggplot(eda_data, aes(eda_data$`Borough of Occurrence`, fill = eda_data$`Complaint Filed Mode`))
stack_plot + geom_bar() +
scale_fill_discrete(name = "Complaint Filed Mode") +
xlab('Borough of Occurence') +
theme_solarized()
Viz:5 - From the below graph, we can see that the histogram is left skewd with most of the data populated at the higher end. The frequency distribution indicates maximum incident occured between 2005 and 2010 with no incidents around year 2000 and constant decrease from 2010 to 2015.
incident_hist = ggplot(eda_data, aes(eda_data$`Incident Year`))
incident_hist + geom_histogram(binwidth = 1.0, color = "green") + xlab("Incident Year") +
ylab("Frequency") + ggtitle('Histogram of Incident Year in CCRB Report')
Viz:6 - The below visualization indicates percentage distribution of encounter outcome and it’s density plot as per borough of occurence.
library(grid)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.6.3
library(Rmisc)
## Warning: package 'Rmisc' was built under R version 3.6.3
## Loading required package: lattice
## Loading required package: plyr
encounter = table(eda_data$`Encounter Outcome`)
lbls <- names(encounter)
lbls
## [1] "Arrest" "No Arrest or Summons" "Other/NA"
## [4] "Summons"
pct <- round(encounter/sum(encounter)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
plot1 = pie(encounter,labels = lbls, col=rainbow(length(lbls)),main="Encounter Outcome in CCRB Report")
density_plot = ggplot(eda_data, aes(x=eda_data$`Encounter Outcome`, fill=eda_data$`Borough of Occurrence`, color = eda_data$`Borough of Occurrence`))
density_plot + geom_density(alpha=0.4) +
xlab("Encounter Outcome") +
ggtitle("Encounter Outcome per Borough in CCRB Report") +
theme_economist()
Viz:7 - The below bar chart shows distribution of Allegation FADO Type and from th egraph it is clear that Abus eof Authority hasa maximum occurence compared to others.
allegation = table(eda_data$`Allegation FADO Type`)
lbls = names(allegation)
barplot(allegation,
xlab = "Allegation FADO Type",
ylab = "Number",
main = "Allegation FADO Type in CCRB Report",
horiz = FALSE,
legend.text = TRUE,
cex.axis = 1.0,
cex.names = 1.0,
col=rainbow(length(lbls)))
Viz:8 - The below graph shows relationship between Incident Year and Complaint Filed Place. We can also identify outliers from this graph.
box_plot = ggplot(eda_data, aes(x = eda_data$`Complaint Filed Place`, y = eda_data$`Incident Year`)) + geom_boxplot()
box_plot + xlab("Complaint Filed Place") +
ylab("Incident Year") +
coord_flip() +
theme_wsj()
Viz:9 - In the below graphs we can see the distribution of compalaints as they were received and their correponding closing year.
library(Rmisc)
received_hist = ggplot(eda_data, aes(eda_data$`Received Year`))
plot1 = received_hist + geom_histogram(binwidth = 1.0, color = "blue") + xlab("Received Year") +
ylab("Frequency") + ggtitle('Histogram of Received Year') + theme_economist()
closed_hist = ggplot(eda_data, aes(eda_data$`Close Year`))
plot2 = closed_hist + geom_histogram(binwidth = 1.0, color = "blue") + xlab("Closed Year") +
ylab("Frequency") + ggtitle('Histogram of Closed Year') +theme_economist()
multiplot(plot1, plot2, cols = 2)
Viz 10: The below graph shows relationship between Incident Year and Incident location along with outliers.
box_plot = ggplot(eda_data, aes(x = eda_data$`Incident Location`, y = eda_data$`Incident Year`)) + geom_boxplot(notch = FALSE, aes(fill = eda_data$`Incident Location`))
box_plot + xlab("Incident Location") +
ylab("Incident Year") +
coord_flip()
We were given the data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB). The “Complaints_Allegations” sheet of this excel file contains data on all CCRB jurisdiction complaints closed in or after 2006. If we looked at the data we understand that it has 204397 and 16 variables which provides information about how, when and what complaints were provided but we can not make any statistical inference by just looking at the data. In this situation, exploratory data analysis comes in handy. We produced multiple visualization to understand the data and from these visualization we can make inferences like: 1. Most of the incidents occured in Brooklyn (Borough of occurence) 2. Phone is the most popular mode of filing complaints 3. Peak of incident occurence can be seen between 2005 and 2010 4. Arrest is the most popular outcome of encounter, etc.
So with Exploratory Data Analysis concrete inferences can be made backed up by statistical and visualization proof. It helps establish foundation for further complex analysis. a