Data Evaluation: NYPD Complaint Data: 2006-2016

The purpose of this document is to provide a preliminary evaluation of the quality of the NYPD Historical Complaint Data for use in general analysis or as a data component to be used in a wider analysis that asks:

Is crime on the rise over the last 10 years in New York City?

The basic procedure for evaluation involves a brief assessment of the data contents and the data quality. Confirming the data contents tells the analyst if the data contains useful information for their query and checking the data quality helps the analyst evaluate the work required to extract the information:

Data Contents:
- Scope of available information
- data formats,
- size,
Data Quality:
- Density of records (missing data)
- Consistency of records (computer readable)

Data Contents:

Scope:

The New York City (NYC) OpenData program provides access to the information that is produced and used by City government to encourage civic transparency and engagement. In support of this goal, the New York City Police Department (NYPD) shares its historical crime records with the public, which “…includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of last year (2016).”

Formats:

The data has the following stated contents:

5.58M records within a .csv file;
24 columns of information including: date the crime was reported, type of crime, borough of crime, description of the crime, etc.; and
documentation with the data’s footnotes, exclusions, and specifics as well as a detailed description of all 24 columns of information with corresponding datatypes.

Size:

Getting a grasp of the data size can help an analyst select the software or method for extracting the information. As this data set contains more than 5 million records it is too large for Excel to handle without first breaking up the primary data into smaller parts. One option for a data set of this size is R (a programming that language that can be thought of as a powerful spreadsheet).

The R code used to read and query the .csv file is available by clicking the “code” buttons to the right of this text, when viewing this document online.

Quality:

Public-facing data portals can vary dramatically with respect to the quality of the entered records and the completeness of entries within the data. Below is the quality assessment of this data set.

Density of Records:

The number of records that contain missing information is small in comparison to the total data available - approximately 40,000 records (less than 1% of the data) have some piece of information missing. However, the categories most likely to be sampled are more complete.

Therefore, the number of lost data points is unlikely to impact the results of the primary investigation query.

In addition, I have confirmed the data covers the stated 10-year time frame:

unique(nypd_data$Year)

##  [1] 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2016

Although the years are not in order, the above example query confirms that all years are present.

Consistency of Records:

To ready the data for analysis, it’s important to make sure the entries within the records are consistent in format. For example, the difference between ‘2006’ and ‘06’ in the ‘Year’ column can skew query results and create confusion. The table below shows a snapshot of some of the information available in this data set.

knitr::kable(head(nypd_data),"html", 
             caption = "First 6 Records, NYPD Crime Data, selected columns") %>% 
  kable_styling(position = "left", full_width=F, 
                bootstrap_options = c("striped", "hover"))

First 6 Records, NYPD Crime Data, selected columns
Year	Year_Month	Report_Dt	Crime_Category	Borough_Name	Premises_Type	Attmptd_Cmpltd	Latitude	Longitude
2015	2015-12	2015-12-31	FELONY	BRONX	BAR/NIGHT CLUB	COMPLETED	40.82885	-73.91666
2015	2015-12	2015-12-31	FELONY	MANHATTAN	OTHER	COMPLETED	40.80261	-73.94505
2015	2015-12	2015-12-31	MISDEMEANOR	QUEENS	RESIDENCE-HOUSE	COMPLETED	40.65455	-73.72634
2015	2015-12	2015-12-31	MISDEMEANOR	MANHATTAN	OTHER	COMPLETED	40.73800	-73.98789
2015	2015-12	2015-12-31	FELONY	BROOKLYN	DRUG STORE	ATTEMPTED	40.66502	-73.95711
2015	2015-12	2015-12-31	MISDEMEANOR	MANHATTAN	STREET	COMPLETED	40.72020	-73.98874

This provides a good preview of the confirmed consistency of the entries, e.g., ‘MANHATTAN’ appears in all caps, without abbreviation or in alternative format throughout the data.

Conclusion:

The NYPD historical crime data contains many dimensions of information that would be valuable in determining whether crime is on the rise in the five boroughs over the past 10 years. However, it is unlikely to be the sole source of information.

Some considerations:

The raw counts provide a base for approaching the question, but the data set lacks any corresponding annual population values that could, for example, allow for the calculation of incidents of crime per 100K New Yorkers.
The size of the data is unwieldy so it may be wise to reduce the time frame to five years and perhaps reduce the query to a single borough at a time, too - the OpenData portal provides an option for limiting scope of the original file.
Intra-year variances could be due to changes in collection procedures within the reporting years and may need to be investigated further.

Sample Visualizations:

Below are some general visualizations of the data following some light manipulation and formatting:

b_data <- nypd_data %>%                         # create annual totals
  group_by(Borough_Name, Year) %>% 
  summarise(Reported_Crimes = n())
  
ggplot(b_data, aes(x = Year, y = Reported_Crimes,  fill = Borough_Name)) + 
  geom_bar(stat = "identity", position = "dodge", alpha = .9) +
  my_yr_plot_breaks + 
  scale_y_continuous(labels = scales::comma,
                     breaks = pretty(b_data$Reported_Crimes, n = 8)) + 
  my_theme +
  labs(title = "NYPD Complaints:",
       subtitle = "Raw Counts 2006-16")

Below is a mapping of NYPD’s 2016 complaints with status “completed” in Manhattan (i.e. it excludes attempted crimes which failed or were interrupted prematurely). The Longitude and Latitude GPS coordinates within the data allow one to place markers on the map that coincide with the rough locations where the complaints occurred.

If viewing online, zooming in on the map will cause the clustered points to breakdown further revealing finer detail. Clicking on a point will provide a label indicating the type of crime category: Violation, Misdemeanor, or Felony.

m_data <- nypd_data %>%
  filter(grepl('2016', Year),
         grepl('COMPLETED', Attmptd_Cmpltd),
         grepl('MANHATTAN', Borough_Name))

leaflet(data = m_data) %>% addTiles() %>%
  addMarkers(~Longitude, ~Latitude,
             popup = ~Premises_Type,
             label = ~Crime_Category,
             clusterOptions = markerClusterOptions(),
             options = popupOptions(style = list("color" = "red",
                                                 "font-size" = "12px")))

manhat_only <- nypd_data %>%
  filter(grepl('MANHATTAN', Borough_Name)) %>% 
  group_by(Crime_Category, Year) %>% 
  summarise(Reported_Crimes = n())
  
ggplot(manhat_only, aes(x = Year, y = Reported_Crimes,  fill = Crime_Category)) + 
  geom_bar(stat = "identity", position = "dodge", alpha = .9) +
  my_yr_plot_breaks +
  scale_y_continuous(labels = scales::comma,
                     breaks = pretty(b_data$Reported_Crimes, n = 8)) + 
  my_theme +
  labs(title = "NYPD Complaints by Type: Manhattan",
       subtitle = "Raw Counts 2006-16")