The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.
A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.
setwd ("~/Documents")
library(readxl)
library(ggthemes)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats)
ccrb = read_excel("~/Documents/ccrb_datatransparencyinitiative.xlsx")
summary(ccrb)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location
## Mode :logical Length:204397
## FALSE:119856 Class :character
## TRUE :84541 Mode :character
##
##
##
## Incident Year Encounter Outcome Reason For Initial Contact
## Min. :1999 Length:204397 Length:204397
## 1st Qu.:2007 Class :character Class :character
## Median :2009 Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Allegation FADO Type Allegation Description
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
This bar chart is to view the distribution of complaints filed by different modes, and within each mode, the breakdown of Allegation Type.
This shows that Phone Call is the mode with most of complaints filed, it is significantly larger than all others. Mail, email and fax are the three lowest and very small as compared to others.
Across the modes, Abuse of Authority and Force are commonly indicated as the two alegation types with highest complaints filed.
ggplot(ccrb, aes(x = fct_infreq(ccrb$`Complaint Filed Mode`), fill=ccrb$`Allegation FADO Type`)) + geom_bar() + labs(tle="Number of Compaints Received by Different Modes", x="Compaint Filed Mode", y="Number of Complaints" ) + theme(legend.position = "bottom") + scale_fill_discrete(name = "Allegation Type")
The subset of total population is built for complaints filed by phone because it is the single most frequently used mode.
Visualization 2 through 4 use this subset.
This bar chart is to view the distribution of complaints across boroughs within all complaints filed through phone.
Brooklyn has the highest complaints filed, followed by Bronx. The two contribute to more than half of all phone filed complaints. The lowest is Staten Island.
Of course, the popualtion density of the boroughs would impact the number of complaints too, which is not in the data.
phone=subset(ccrb,ccrb$`Complaint Filed Mode`=="Phone")
ggplot(phone, aes(x=factor(1), fill=phone$`Borough of Occurrence`))+geom_bar(stat = "count") + labs(tle="Number of Complaints Filed by Phone Across Boroughs") + theme(legend.position = "bottom") + scale_fill_discrete(name="Boroughs") + coord_polar(theta = "y")
This line chart is to view the anuual trend of complaints filed by phone. The horizontal axis is year in which the complaints are received and vertical axis is the number of complaints filed in that year.
The trend line shows that before Year 2005, the data was almost zero with only one data point and then suddenly increased in Year 2006. This must be due to the data availability issue and does not reflect real number of complains received. After Year 2006, the number of complains received per year steadily dropped at a relatively constant speed.
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
phone.trend = aggregate(phone$UniqueComplaintId, by=list(Year=phone$`Received Year`), FUN=sum)
ggplot(data=phone.trend, aes(x=phone.trend$Year,y=phone.trend$x)) + geom_line(alpha=0.5) + ggtitle("Annual Trend of Complaints Filed by Phone") + xlab("Received Year") +ylab("Number of Complaints Filed by Phone") + theme_economist()
This is to plot the scatter plot of Closed Year in which a complaint is closed versus Received Year in which that complaint is received, for complaints filed by phone.
The distribution of scatter plots show a constant trend, indicating that the processing time of complaints has been stable over the years.
ggplot(phone, aes(x=phone$`Received Year`, y=phone$`Close Year`)) + geom_point(shape=14, color="pink") + geom_smooth(method = lm, se=FALSE,color="green")+labs(tle="Relationship between Closed Year and Received Year", x="Received Year", y="Closed Year")
This bar chart is to show how many complaints are fully investigated and how many are not in each year during the entire period.
It shows that for complaints received in Year 2005, more than half of complaints were fully investigated, but the trend changed later on. For more recent years after 2008, less than half of the complaints received were fully investigated.
ggplot(ccrb,aes(x=ccrb$`Received Year`, fill=ccrb$`Is Full Investigation`)) + geom_histogram(stat = "count")+labs(tle="Full Investigation by Year", x="Year", y="Number of Full Investigation vs. Not")+ theme(legend.position = "bottom") + scale_fill_discrete(name = "Full Investigation")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This bar chart is to show the number of complaints that had video evidences and of those that did not in each year during the entire period.
It shows that until Year 2010,almost no complaints had no video evidences. After then, more complaints received had video evidences. This is logical as technology became more readily available.
ggplot(ccrb,aes(x=ccrb$`Received Year`, fill=ccrb$`Complaint Has Video Evidence`)) + geom_histogram(stat = "count")+labs(tle="Video Evidence by Year", x="Year", y="Number of Video Evidence vs. Not")+ theme(legend.position = "bottom") + scale_fill_discrete(name = "Video Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This horizontal bar chart is to show the distribution of complaints filed aross boroughs and within each borough, the distribution across locations of the incidents.
The chart shows that Brooklyn and Bronx are the two boroughs with highest complaints filed, similar to the findings for the complaints filed by phone.
Across boroughs, street/highway is the location with the highest complains filed, followed by subway station/train.
ggplot(ccrb, aes(x=ccrb$`Borough of Occurrence`,fill=ccrb$`Incident Location`)) + geom_bar(stat = "count") + labs(tle="Incident Location by Borough", x="Number of Incidents",y="Incident Location") + coord_flip() + theme(legend.position = "bottom") + scale_fill_discrete(name="Boroughs")
The subset for the incident location Street/highway is thus taken to take a closer look.
This rank chart is to show the ranking of number of complaints filed for different Reasons For Initial Contact in ascending order for all incidents happend on street/highway.
The chart shows that PD suspected C/V of violation/crime - street is significantly more frequent than all other reasons.
street=subset(ccrb,ccrb$`Incident Location`=="Street/highway")
street_rank = data.frame(sort(table(street$`Reason For Initial Contact`),decreasing = TRUE))
ggplot(street_rank[1:10,], aes(Var1, Freq)) +geom_point()+coord_flip()
This box plot is to show the distribution of complaints happened across years by the outcomes.
The median of all outcomes are 2009, with Arrest having a more sparse distribution.
ggplot(ccrb, aes(tle="Encounter Outcomes by Year", y=ccrb$`Incident Year`,x=ccrb$`Encounter Outcome`)) + geom_boxplot(fill="pink",color="green3") + scale_x_discrete(name = "Encounter Outcome") + scale_y_continuous(name="Incidnet Year")
## Visualization 10
After seeing the distribution of different outcomes across years, we’d like to see the breakdown of all complaints filed by outcomes.
The pie chart shows that Arrest and No Arrest/Summons have the highest complaints, they left very few complaints with other two outcomes.
ggplot(ccrb, aes(x=factor(1), fill=ccrb$`Encounter Outcome`))+geom_bar(stat = "count") + labs(tle="Number of Complaints Filed by Encounter Outcomes") + theme(legend.position = "bottom") + scale_fill_discrete(name="Encounter Outcome") + coord_polar(theta = "y")