Objectives

The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.

For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Section 1

library(ggplot2)
library(ggthemes)

retrieve data from file

data=read.csv('/Users/Katherine 1/Desktop/ANLY 512 Data Visualization/Problem set 4/Complaints_Allegations.csv')

1. Number of Complaints by Years

ggplot(data, aes(x=Incident.Year, fill=Incident.Year)) + 
  geom_histogram(stat = "count") +
  labs(title="Number of Complaints by Years", x="Years", y="Number of Complaints")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

When I first observed the data, the first thing I want to know is how number of complaints changed over years and whether it increased or decreased. From the above graph, before 2007, the number of complaints increased but after 2007, number of complaints have been decreasing.

2. Incident Years VS. Received Years

ggplot(data, aes(x=Incident.Year, y=Received.Year)) + 
  geom_point() + 
  geom_smooth(method = lm) +
  labs(title="Incident Years VS. Received Years", x="Incident Years", y="Received Years")

The above plot tells me insidents happens may not be the same time as the insidents acutally recieved. Some of incidents received even after years when incidents happened.

3. Received Years VS. Close Years

ggplot(data, aes(x=Received.Year, y=Close.Year)) + 
  geom_point() + 
  geom_smooth(method = lm) +
  labs(title="Received Years VS. Close Years", x="Received Years", y="Close Years")

This graph tells most of time, an incident took years to close. But over time, the difference between received years and close years gets

4. Number of Complaints by Borough and Type

ggplot(data, aes(x=Borough.of.Occurrence, fill=Allegation.FADO.Type)) +
  geom_histogram(stat="count") + 
  labs(title="Number of Complaints by Borough and Type", x="Borough of Occurence", y="Number of Complaints")+
  scale_fill_discrete(name="Allegation Type")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

This chart tells Brooklyn has the highest complaints. Among different types of allegation, abuse of authority takes majority of all the complaints, and force is the second.

5. Number of Complaints by Incident Locations

ggplot(data, aes(x=Incident.Location, fill=Incident.Location)) + 
  geom_bar(stat = "count") +
  labs(title="Number of Complaints by Incident Lacations", x="Locations", y="Number") +
  scale_fill_discrete(name="Incident Location") +
   theme(legend.position = "bottom")

This chart gives a direct information about which location has highest incidents or complaints. Street/highway has much higher complaints than any other places, then it is apartment/house.

6. Number of Complaints by Boroughs and Incident Locations

ggplot(data, aes(x=Borough.of.Occurrence, fill=Incident.Location)) +
  geom_histogram(stat="count") + 
  labs(title="Number of Complaints by Location", x="Borough of Occurrence", y="Number of Complaints")+
  scale_fill_discrete(name="Incident Locations")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

As we can see from above graph, street/highway is the place where most of incidents occurred and the second is apartment/house. Brooklyn still has the highest complaints among different boroughs.

7. Number of Complaints by Filed Mode

ggplot(data, aes(x=Complaint.Filed.Mode, fill=Complaint.Filed.Mode)) + 
  geom_bar(stat = "count") +
  labs(title="Number of Complaints by Filed Mode", x="Complaints Filed Mode", y="Number of Complaints")

This graphs shows that most of complaints are filed by phone, then call processing system. Fax is the lowest and this may have some evidence involved.

8. Number of Complaints Filed Mode over Years

ggplot(data, aes(x=Incident.Year, fill=Complaint.Filed.Mode)) +
  geom_histogram(stat="count") + 
  labs(title="Number of Complaints Filed Mode over Years", x="Incident Year", y="Number of Complaints Filed Mode")+
  scale_fill_discrete(name="Complaints Filed Mode")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Phone has been the most effective way in filing complaints over years, then call processing system. With technology and online system develop, on-line website is used more in filing complaints.

9. Complaits Filed Mode by Boroughs

ggplot(data, aes(x=Borough.of.Occurrence, fill=Complaint.Filed.Mode)) +
  geom_histogram(stat="count") + 
  labs(title="Number of Complaints Filed Mode by Borough", x="Borough of Occurence", y="Number of Complaints Filed Mode")+
  scale_fill_discrete(name="Complaints Filed Mode")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Compared to other boroughs, Brooklyn has highes phone complaints and call processing system.

10. Incident Locations VS. Encounter Outcome

ggplot(data, aes(x=Incident.Location, fill=Encounter.Outcome)) +
  geom_histogram(stat="count") + 
  labs(title="Incident Locations VS. Encounter Outcome", x="Incident Locations", y="Encounter Outcome") +
   theme(legend.position = "bottom") + 
  scale_fill_discrete(name="Encounter Outcome")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Majority of the insidents or complaints ended up with arrest. For those green stands for no arrest or summons may end up with warnings. Almost all the incidents happened in street/highway end up with summons.

Section 2: Conclusion

  This exercise gives me a deep understanding in how data visulization provides a clear and direct way in showing information and gathering data from different groups or categories. It can tell relationships between data from different groups. When receiving a data set, there are some questions coming into mind. For example, which borough has highest incidents occurred? It isn't smart to do a count. Having this question in mind, visual graphs not only tell the estimated number but do comparisons among differnt categories, which also the definition of Exploratory Data Analysis (EDA), the iterative process by which and analyst gains a quantitative and qualitative understanding of a data set through asking and answering questions.

ANLY 512 - Problem Set 4

Exploratory Data Analysis_Complaints Allegations

Shan Huang

07/04/2017