Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:

# This is a top section

## This is a subsection

Looking into the Dataset

ccrb <- read.csv("/Users/rohitmishra/Desktop/Harrisburg University/ANLY 512/ccrb.csv")
head(ccrb)

##    DateStamp UniqueComplaintId Close.Year Received.Year
## 1 11/29/2016                11       2006          2005
## 2 11/29/2016                18       2006          2004
## 3 11/29/2016                18       2006          2004
## 4 11/29/2016                18       2006          2004
## 5 11/29/2016                18       2006          2004
## 6 11/29/2016                18       2006          2004
##   Borough.of.Occurrence Is.Full.Investigation Complaint.Has.Video.Evidence
## 1             Manhattan                 FALSE                        FALSE
## 2              Brooklyn                  TRUE                        FALSE
## 3              Brooklyn                  TRUE                        FALSE
## 4              Brooklyn                  TRUE                        FALSE
## 5              Brooklyn                  TRUE                        FALSE
## 6              Brooklyn                  TRUE                        FALSE
##   Complaint.Filed.Mode Complaint.Filed.Place
## 1      On-line website                  CCRB
## 2                Phone                  CCRB
## 3                Phone                  CCRB
## 4                Phone                  CCRB
## 5                Phone                  CCRB
## 6                Phone                  CCRB
##   Complaint.Contains.Stop...Frisk.Allegations Incident.Location
## 1                                       FALSE    Street/highway
## 2                                       FALSE    Street/highway
## 3                                       FALSE    Street/highway
## 4                                       FALSE    Street/highway
## 5                                       FALSE    Street/highway
## 6                                       FALSE    Street/highway
##   Incident.Year    Encounter.Outcome
## 1          2005 No Arrest or Summons
## 2          2004               Arrest
## 3          2004               Arrest
## 4          2004               Arrest
## 5          2004               Arrest
## 6          2004               Arrest
##                     Reason.For.Initial.Contact Allegation.FADO.Type
## 1                                        Other   Abuse of Authority
## 2 PD suspected C/V of violation/crime - street   Abuse of Authority
## 3 PD suspected C/V of violation/crime - street          Discourtesy
## 4 PD suspected C/V of violation/crime - street          Discourtesy
## 5 PD suspected C/V of violation/crime - street          Discourtesy
## 6 PD suspected C/V of violation/crime - street                Force
##                Allegation.Description
## 1                    Threat of arrest
## 2 Refusal to obtain medical treatment
## 3                                Word
## 4                                Word
## 5                                Word
## 6                      Physical force

ccrb <- unique(ccrb)

Visualization 1:

Brooklyn has the maximum number of incidents outside of Newyork filed to CCRB.

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Warning: package 'tidyr' was built under R version 3.4.1

## Warning: package 'purrr' was built under R version 3.4.1

## Warning: package 'dplyr' was built under R version 3.4.1

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(ggplot2)
library(forcats)
library(dplyr)
ggplot(ccrb, aes(x = fct_infreq(Borough.of.Occurrence))) +
  geom_bar() + xlab("Borough.of.Occurrence")

Visualization 2:

We can find the decreasing trend of number of complaints in recent years. The peak is around 2005-2006 and we see that it has been decreasing.

library(ggplot2)
library(ggthemes)
library(readxl)
library(dplyr)
df.by.receiveyear <- ccrb %>% 
                          group_by(Received.Year) %>%
                            summarize(num_case = n_distinct(UniqueComplaintId)) %>%
                              select(Received.Year, num_case)

ggplot(data = df.by.receiveyear, aes(x = Received.Year, y = num_case)) + 
  geom_line(alpha = 0.5) + 
  ggtitle('Figure 1: Number of Complaints by Received Year') + 
  xlab('Received Year') + 
  ylab('Number of Cases') + 
  theme_economist()

Visualization 3:

The way a complaint is filed shows that most people prefer registering complaints via the phone.

ccrb %>% mutate(Complaint.Filed.Mode = Complaint.Filed.Mode %>% fct_infreq() %>% fct_rev()) %>% 
ggplot(aes(Complaint.Filed.Mode)) +
geom_bar() + coord_flip() + xlab("Complaint.Filed.Mode")

Visualization 4:

We see that most outcomes have no arrests but lead to summons

ggplot(ccrb, aes(x = fct_infreq(Encounter.Outcome))) +
  geom_bar() + xlab("Encounter.Outcome")

Visualization 5:

For the length of complaint processing, we can see that the responding time for complaint processing is not that long. For majority of the cases, it is closed within a year or between 1-2 years.

df.dif <- ccrb %>% 
                distinct(UniqueComplaintId, .keep_all = TRUE) %>%
                  mutate(time_length = Close.Year - Received.Year)

ggplot(data = df.dif, aes(x = time_length)) + 
  geom_bar(width = 0.5, alpha = 0.5) + 
  labs(title = 'Figure 3: Time Length for Complaints to Be Processed', x = 'Time Length (Years)') +
  theme_economist()

visualization 6:

Incident Outcome: We can find that the majority of the arrest decisions is through full investigation. However, there are still a large number of cases where arrests are made without full investigation or no arrest/summon is made without full investigation.

ggplot(data = ccrb , aes(x = Encounter.Outcome, fill = Is.Full.Investigation)) + 
  geom_bar(stat = 'count', alpha = 0.5) + 
  labs(title='Figure 10: Indicent Outcome and Full Investigation Relationship') + 
  scale_fill_discrete(name = 'Full Investigation or Not') +
  theme_economist()

Visualization 7:

order<- data.frame(sort(table(ccrb$Reason.For.Initial.Contact),decreasing = TRUE))
ggplot(order[1:10,],aes(Var1,Freq))+geom_point()+coord_flip()

#visualization8: When each allegation type is split by whether Full Investigation is done or not then the proportion of true vs. false is quite close

ggplot(data = ccrb, aes(x = Allegation.FADO.Type, fill = Is.Full.Investigation)) + 
  geom_bar(stat = 'count') + 
  labs(title='Allegation Type vs. Full Investigation Relationship') + 
  scale_fill_discrete(name = 'Full Investigation or Not') + geom_text(stat='count',aes(label=..count..))

Visualization 9:

Incident Location:When the incident location is arranged in descending order of frequency counts, we see that the most common location for incidents is the street/highway

ggplot(ccrb, aes(x = fct_infreq(Incident.Location))) +
  geom_bar() + xlab("Incident.Location") + coord_flip()

#Visualization 10:

library(vcd)

## Loading required package: grid

ccrb1 <- na.omit(ccrb)
x <- ccrb1[,c(13,15)]
assoc(~ Allegation.FADO.Type + Encounter.Outcome, data = x, shade = TRUE)

Summary:

Exploratory Data Analysis (EDA) is of great importance to summarize the basic relationship among varibles. It is helpful because it can help find the questions that are already answered and questions that still needed to be digged into. For instance, in this dataset, through EDA, we can find that the responding time is quite efficient, because the waiting time for cases to be closed or received is low. However, the investigation visualization result implied that there is room to analyze why the investigation rate is low.

ANLY 512 - Problem Set 5

Exploratory Data Analysis

Rohit Mishra

2017-09-17