The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment you should submit and a rpubs link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using meaningful section headings:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Since there can be multiple allegations associated to a unique complaint, the first step is to re-organize the dataset to keep only one row for an unique complaint so that the count of complaints doesn’t include the same complaint multiple times.
This graph shows the number of complaints received by CCRB each year by the filed modes.
The number of complaints has been decreased since 2010. Most of complaints are received by phone and call processing system. There is also an increasing use of online website compared to filing in person.
This pie chart shows the reasons for initial contact. More than 1/3 of the complaits are related to PD suspected C/V of violation/crime on street, building and auto. Dispute and moving violations are also main reasons to contact CCRB.
A new variable “Processing Time” is calculated to show the duration between the complaint was received by CCRB and when it was closed.
As we can see from the graph, most of the complaints were closed within the same year or 1 year after they were reported to CCRB. However, there were a few cases taken longer than 8 years to close.
What could be the factors impacting processing time? According to the dataset, not all the complaints were fully investigated by the CCRB before it is closed. The processing time may be related to whether the complaint was fully investigated and whether it has video evidence.
The upper-left part of the graph tells us the processing time of cases that are not fully investigated and has no video evidence, which is the most common scenario of the complaints. Cases in this scenario are more likely to be closed within the same year after it was received.
If a case has video evidence (i.e., the bottom row of the graph), it usually takes longer to process than cases that have no video evidence.
However, there is no significant indicator that complaints with video evidence are more likely to receive fully investigation. Even if a case was fully investigated, it could be closed within a year if the complaint doesn’t have video evidence.
This graph shows the number of complaints received by CCRB each year by the borough of incidents. The largest number of incidents happened in Brooklyn among all boroughs of NYC. Queens and Staten Island are relatively safter than Bronx and Manhattan.
This graph shows the number of incidents occurred in each borough throughout the years. The number of incidents happened in Brooklyn have been reduced by 2/3 in 2015 compared to 2005. Bronx and Manhattan have also seen significant reduce in the number of complaints. The number of incidents happened in Staten Island hasn’t changed much in the last ten years.
This pie chart shows the locations where the incidents happened. Most of incidents happened on street and highway. Another major incident location is apartments and residential houses.
This bar chart shows the number of allegations by the FADO types and the outcome of the encounter. Most of allegations are related to abuse of authority. If the allegation was related to Force, the offender was very possible to be arrested. In cases of abuse of authority and discourtesy, the outcome is usually summons and no arrest.
Most of the Abuse of Authority allegations happened in Brooklyn. For the type of Discourtesy, Force and Offensive Language, however, there is not a distinctive pattern of which allegation FADO type is more likely to happen in which borough.
Overall, all types of allegations have been reduced since 2006, while Abuse of Authority allegation has seen the most significant change over the years.
This data analysis focuses on exploring the patterns and relationships of discrete, nominal and binary data in the NYC Data Transparnecy Initiative dataset.
Bar chart and pie chart are good tools to present categorical data. Stacked bar chart is helpful in explorying the potential relationship across three variables. Categorical data can also be manipulated using R functions and transferred into continuous data, such as counts, for quantitative analysis.