Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

setwd ("~/Documents")

library(readxl)
library(ggthemes)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forcats)

ccrb = read_excel("~/Documents/ccrb_datatransparencyinitiative.xlsx")

summary(ccrb)

##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location 
##  Mode :logical                               Length:204397     
##  FALSE:119856                                Class :character  
##  TRUE :84541                                 Mode  :character  
##                                                                
##                                                                
##                                                                
##  Incident Year  Encounter Outcome  Reason For Initial Contact
##  Min.   :1999   Length:204397      Length:204397             
##  1st Qu.:2007   Class :character   Class :character          
##  Median :2009   Mode  :character   Mode  :character          
##  Mean   :2010                                                
##  3rd Qu.:2012                                                
##  Max.   :2016                                                
##  Allegation FADO Type Allegation Description
##  Length:204397        Length:204397         
##  Class :character     Class :character      
##  Mode  :character     Mode  :character      
##                                             
##                                             
##

Visualization 1

This bar chart is to view the distribution of complaints filed by different modes, and within each mode, the breakdown of Allegation Type.

This shows that Phone Call is the mode with most of complaints filed, it is significantly larger than all others. Mail, email and fax are the three lowest and very small as compared to others.

Across the modes, Abuse of Authority and Force are commonly indicated as the two alegation types with highest complaints filed.

ggplot(ccrb, aes(x = fct_infreq(ccrb$`Complaint Filed Mode`), fill=ccrb$`Allegation FADO Type`)) + geom_bar() + labs(tle="Number of Compaints Received by Different Modes", x="Compaint Filed Mode", y="Number of Complaints" ) + theme(legend.position = "bottom") + scale_fill_discrete(name = "Allegation Type")

Visualization 2

The subset of total population is built for complaints filed by phone because it is the single most frequently used mode.

Visualization 2 through 4 use this subset.

This bar chart is to view the distribution of complaints across boroughs within all complaints filed through phone.

Brooklyn has the highest complaints filed, followed by Bronx. The two contribute to more than half of all phone filed complaints. The lowest is Staten Island.

Of course, the popualtion density of the boroughs would impact the number of complaints too, which is not in the data.

phone=subset(ccrb,ccrb$`Complaint Filed Mode`=="Phone")

ggplot(phone, aes(x=factor(1), fill=phone$`Borough of Occurrence`))+geom_bar(stat = "count") + labs(tle="Number of Complaints Filed by Phone Across Boroughs") + theme(legend.position = "bottom") + scale_fill_discrete(name="Boroughs") + coord_polar(theta = "y")

Visualization 3

This line chart is to view the anuual trend of complaints filed by phone. The horizontal axis is year in which the complaints are received and vertical axis is the number of complaints filed in that year.

The trend line shows that before Year 2005, the data was almost zero with only one data point and then suddenly increased in Year 2006. This must be due to the data availability issue and does not reflect real number of complains received. After Year 2006, the number of complains received per year steadily dropped at a relatively constant speed.

library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

phone.trend = aggregate(phone$UniqueComplaintId, by=list(Year=phone$`Received Year`), FUN=sum)

ggplot(data=phone.trend, aes(x=phone.trend$Year,y=phone.trend$x)) + geom_line(alpha=0.5) + ggtitle("Annual Trend of Complaints Filed by Phone") + xlab("Received Year") +ylab("Number of Complaints Filed by Phone") + theme_economist()

Visualization 4

This is to plot the scatter plot of Closed Year in which a complaint is closed versus Received Year in which that complaint is received, for complaints filed by phone.

The distribution of scatter plots show a constant trend, indicating that the processing time of complaints has been stable over the years.

ggplot(phone, aes(x=phone$`Received Year`, y=phone$`Close Year`)) + geom_point(shape=14, color="pink") + geom_smooth(method = lm, se=FALSE,color="green")+labs(tle="Relationship between Closed Year and Received Year", x="Received Year", y="Closed Year")

Visualization 5

This bar chart is to show how many complaints are fully investigated and how many are not in each year during the entire period.

It shows that for complaints received in Year 2005, more than half of complaints were fully investigated, but the trend changed later on. For more recent years after 2008, less than half of the complaints received were fully investigated.

ggplot(ccrb,aes(x=ccrb$`Received Year`, fill=ccrb$`Is Full Investigation`)) + geom_histogram(stat = "count")+labs(tle="Full Investigation by Year", x="Year", y="Number of Full Investigation vs. Not")+ theme(legend.position = "bottom") + scale_fill_discrete(name = "Full Investigation")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Visualization 6

This bar chart is to show the number of complaints that had video evidences and of those that did not in each year during the entire period.

It shows that until Year 2010,almost no complaints had no video evidences. After then, more complaints received had video evidences. This is logical as technology became more readily available.

ggplot(ccrb,aes(x=ccrb$`Received Year`, fill=ccrb$`Complaint Has Video Evidence`)) + geom_histogram(stat = "count")+labs(tle="Video Evidence by Year", x="Year", y="Number of Video Evidence vs. Not")+ theme(legend.position = "bottom") + scale_fill_discrete(name = "Video Evidence")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Visualization 7

This horizontal bar chart is to show the distribution of complaints filed aross boroughs and within each borough, the distribution across locations of the incidents.

The chart shows that Brooklyn and Bronx are the two boroughs with highest complaints filed, similar to the findings for the complaints filed by phone.

Across boroughs, street/highway is the location with the highest complains filed, followed by subway station/train.

ggplot(ccrb, aes(x=ccrb$`Borough of Occurrence`,fill=ccrb$`Incident Location`)) + geom_bar(stat = "count") + labs(tle="Incident Location by Borough", x="Number of Incidents",y="Incident Location") + coord_flip() + theme(legend.position = "bottom") + scale_fill_discrete(name="Boroughs")

Visualization 8

The subset for the incident location Street/highway is thus taken to take a closer look.

This rank chart is to show the ranking of number of complaints filed for different Reasons For Initial Contact in ascending order for all incidents happend on street/highway.

The chart shows that PD suspected C/V of violation/crime - street is significantly more frequent than all other reasons.

street=subset(ccrb,ccrb$`Incident Location`=="Street/highway")

street_rank = data.frame(sort(table(street$`Reason For Initial Contact`),decreasing = TRUE))
ggplot(street_rank[1:10,], aes(Var1, Freq)) +geom_point()+coord_flip()

Visualization 9

This box plot is to show the distribution of complaints happened across years by the outcomes.

The median of all outcomes are 2009, with Arrest having a more sparse distribution.

ggplot(ccrb, aes(tle="Encounter Outcomes by Year", y=ccrb$`Incident Year`,x=ccrb$`Encounter Outcome`)) + geom_boxplot(fill="pink",color="green3") + scale_x_discrete(name = "Encounter Outcome") + scale_y_continuous(name="Incidnet Year")

## Visualization 10

After seeing the distribution of different outcomes across years, we’d like to see the breakdown of all complaints filed by outcomes.

The pie chart shows that Arrest and No Arrest/Summons have the highest complaints, they left very few complaints with other two outcomes.

ggplot(ccrb, aes(x=factor(1), fill=ccrb$`Encounter Outcome`))+geom_bar(stat = "count") + labs(tle="Number of Complaints Filed by Encounter Outcomes") + theme(legend.position = "bottom") + scale_fill_discrete(name="Encounter Outcome") + coord_polar(theta = "y")

ANLY 512 - Problem Set 4

Exploratory Data Analysis

Ting Tu

2017-10-09