ANLY 512 - Problem Set 5

Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:

Your final document should include at minimum 10 visualization. Each should include a brief statement of what they show.

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.3.3

library(readxl)

## Warning: package 'readxl' was built under R version 3.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Load Dataset

df <- read_excel("D:/Harrisburg/512-50/ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")
df <- data.frame(df)
str(df)

## 'data.frame':    204397 obs. of  16 variables:
##  $ DateStamp                                  : POSIXct, format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num  11 18 18 18 18 18 18 18 18 18 ...
##  $ Close.Year                                 : num  2006 2006 2006 2006 2006 ...
##  $ Received.Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Borough.of.Occurrence                      : chr  "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is.Full.Investigation                      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint.Has.Video.Evidence               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint.Filed.Mode                       : chr  "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint.Filed.Place                      : chr  "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint.Contains.Stop...Frisk.Allegations: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident.Location                          : chr  "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident.Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Encounter.Outcome                          : chr  "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason.For.Initial.Contact                 : chr  "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation.FADO.Type                       : chr  "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation.Description                     : chr  "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...

Visualization 1: Complaints by Received Year

df.by.receiveyear <- df %>% 
                          group_by(Received.Year) %>%
                            summarize(num_case = n_distinct(UniqueComplaintId)) %>%
                              select(Received.Year, num_case)

ggplot(data = df.by.receiveyear, aes(x = Received.Year, y = num_case)) + 
  geom_line(alpha = 0.5) + 
  ggtitle('Figure 1: Number of Complaints by Received Year') + 
  xlab('Received Year') + 
  ylab('Number of Cases') + 
  theme_economist()

From Figure 1, we can find the decreasing trend of number of complaints in recent years. The peak is around 2005-2006. It implied the complaints situation may be improving now.

Visualization 2: Complaints by Close Year

df.by.closeyear <- df %>% 
                          group_by(Close.Year) %>%
                            summarize(num_case = n_distinct(UniqueComplaintId)) %>%
                              select(Close.Year, num_case)

ggplot(data = df.by.closeyear, aes(x = Close.Year, y = num_case)) + 
  geom_line(alpha = 0.5) + 
  ggtitle('Figure 2: Number of Complaints by Close Year') + 
  xlab('Close Year') + 
  ylab('Number of Cases') + 
  theme_economist()

From Figure 2, number of complaints by close year show general downward trend, however, there is back and forth in recent year as well.

Visualization 3: Length of Time the Complaints is processed

df.dif <- df %>% 
                distinct(UniqueComplaintId, .keep_all = TRUE) %>%
                  mutate(time_length = Close.Year - Received.Year)

## Warning: package 'bindrcpp' was built under R version 3.3.3

ggplot(data = df.dif, aes(x = time_length)) + 
  geom_bar(width = 0.5, alpha = 0.5) + 
  labs(title = 'Figure 3: Time Length for Complaints to Be Processed', x = 'Time Length (Years)') +
  theme_economist()

From Figure 3, we can see that the responding time for complaint process is not that long. For majority of cases, it is closed within a year or in 1-2 years.

Visualization 4: Length of Time the Indicent Year and Received Year

df.dif2 <- df %>% 
                 distinct(UniqueComplaintId, .keep_all = TRUE) %>%
                  mutate(time_length = Received.Year - Incident.Year) %>%
                    select(time_length)

ggplot(data = df.dif2, aes(x = time_length)) + 
  geom_bar(width = 0.5, alpha = 0.5) + 
  labs(title = 'Figure 4: Time Length for Complaints to Be Received', x = 'Time Length (Years)') +
  theme_economist()

From Figure 4, we can find that the majority cases are received within a year when the indicent is happened. It is a good sign that people tend to report the incident right away.

Visualization 5: Length of Time the Indicent Year and Close Year

df.dif3 <- df %>% 
                 distinct(UniqueComplaintId, .keep_all = TRUE) %>%
                  mutate(time_length = Close.Year - Incident.Year) %>%
                    select(time_length)

ggplot(data = df.dif3, aes(x = time_length)) + 
  geom_bar(width = 0.5, alpha = 0.5) + 
  labs(title = 'Figure 5: Time Length for Complaints to Be Closed from Indicent Year', x = 'Time Length (Years)') +
  theme_economist()

From Figure 5, we can find that the majority cases are also closed in a timely manner.

Visualization 6: Summary of Complaint Geography Location and Investigation Situation

ggplot(data = df, aes(x = Borough.of.Occurrence, fill = Is.Full.Investigation )) + 
  geom_bar(width = 0.5, alpha = 0.5, stat = 'count') + 
  labs(title = 'Figure 6: Geography Location for Complaint and Investigation Situation', x = 'Location') +
  scale_fill_discrete(name = 'Full Investigation or Not') +
  theme_economist()

From Figure 6, we can find that the investigation does not differ much because different geography location, which is a good sign. However, in each location, there are many complaints which did not triger the investigation. Maybe there is still improvement opportunity.

Visualization 7: Summary of Investigation and Video Message Relationship

ggplot(data = df, aes(x = Is.Full.Investigation, fill = Complaint.Has.Video.Evidence)) + 
  geom_bar(stat = 'count', alpha = 0.5) + 
  labs(title = 'Figure 7: Investigation and Voice Message Joint Distribution', x = 'Full Investigation or NOt') + 
  scale_fill_discrete(name = 'Video Evidence or Not') + 
  theme_economist()

From Figure 7, we found that the respond situation for complaint is not good. Majority of complaints are not treated seriously, no matter whether video evidence is included. It is a sign for improvement. However, it seemed that the respond probability is a little higher if video eveidence is included. Hence it suggested people putting more evidence when making complaints.

Visualization 8: Summary of Complaint Method

ggplot(data = df, aes(x = Complaint.Filed.Mode)) + 
  geom_bar(stat = 'count') + 
  labs(title = 'Figure 8: Complaint Method Summary', x = 'Method Type') + 
  theme_economist()

From Figure 8, we find that traditional method such as fax or mail is not that highly used today. Phone, call, and website are more prevalent.

Visualization 9: Incident Outcome

ggplot(data = df, aes(x = Encounter.Outcome)) + 
  geom_bar(stat = 'count', alpha = 0.5) + 
  labs(title='Figure 9: Indicent Outcome Summary') + 
  theme_economist()

From Figure 9, we find that one of the common result is arrest. However, there is still lots of cases with no arrest or summon decision.Further analysis is conducted to see the outcome result and investigation result relationship.

Visualization 10: Incident Outcome and Investigation Result

ggplot(data = df, aes(x = Encounter.Outcome, fill = Is.Full.Investigation)) + 
  geom_bar(stat = 'count', alpha = 0.5) + 
  labs(title='Figure 10: Indicent Outcome and Full Investigation Relationship') + 
  scale_fill_discrete(name = 'Full Investigation or Not') +
  theme_economist()

From Figure 10, we can find that the majority arrest decision is through full investigation. However, there is still a large number of cases where arrest is made without full investigation or no arrest/summon is made without full investigation. There is still improvement opportunity.

Summary

Exploratory Data Analysis (EDA) is of great importance to summarize the basic relationship among varibles. It is helpful because it can help find the questions that are already answered and questions that still needed to be digged into. For instance, in this dataset, through EDA, we can find that the responding time is quite efficient, because the waiting time for cases to be closed or received is low. However, the investigation visualization result implied that there is room to analyze why the investigation rate is low.

Besides, data visualization is not a single project, and it should be combined with data collection, data manipulation, and data cleansing. For instance, the variables can be better visualized through some group-by data manipulation methods. Hence, in order to present the relationship among variables better, we should combine our skills not only from data visualization, but from other fields such as data manipulation as well.