The purpose of this document is to provide a preliminary evaluation of the quality of the NYPD Historical Complaint Data for use in general analysis or as a data component to be used in a wider analysis that asks:

Is crime on the rise over the last 10 years in New York City?

The basic procedure for evaluation involves a brief assessment of the data contents and the data quality. Confirming the data contents tells the analyst if the data contains useful information for their query and checking the data quality helps the analyst evaluate the work required to extract the information:

Data Contents:

Scope:

The New York City (NYC) OpenData program provides access to the information that is produced and used by City government to encourage civic transparency and engagement. In support of this goal, the New York City Police Department (NYPD) shares its historical crime records with the public, which “…includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of last year (2016).”

Formats:

The data has the following stated contents:

  • 5.58M records within a .csv file;
  • 24 columns of information including: date the crime was reported, type of crime, borough of crime, description of the crime, etc.; and
  • documentation with the data’s footnotes, exclusions, and specifics as well as a detailed description of all 24 columns of information with corresponding datatypes.

Size:

Getting a grasp of the data size can help an analyst select the software or method for extracting the information. As this data set contains more than 5 million records it is too large for Excel to handle without first breaking up the primary data into smaller parts. One option for a data set of this size is R (a programming that language that can be thought of as a powerful spreadsheet).

The R code used to read and query the .csv file is available by clicking the “code” buttons to the right of this text, when viewing this document online.

Quality:

Public-facing data portals can vary dramatically with respect to the quality of the entered records and the completeness of entries within the data. Below is the quality assessment of this data set.

Density of Records:

The number of records that contain missing information is small in comparison to the total data available - approximately 40,000 records (less than 1% of the data) have some piece of information missing. However, the categories most likely to be sampled are more complete.

Therefore, the number of lost data points is unlikely to impact the results of the primary investigation query.

In addition, I have confirmed the data covers the stated 10-year time frame:

unique(nypd_data$Year)     
##  [1] 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2016

Although the years are not in order, the above example query confirms that all years are present.

Consistency of Records:

To ready the data for analysis, it’s important to make sure the entries within the records are consistent in format. For example, the difference between ‘2006’ and ‘06’ in the ‘Year’ column can skew query results and create confusion. The table below shows a snapshot of some of the information available in this data set.

knitr::kable(head(nypd_data),"html", 
             caption = "First 6 Records, NYPD Crime Data, selected columns") %>% 
  kable_styling(position = "left", full_width=F, 
                bootstrap_options = c("striped", "hover"))
First 6 Records, NYPD Crime Data, selected columns
Year Year_Month Report_Dt Crime_Category Borough_Name Premises_Type Attmptd_Cmpltd Latitude Longitude
2015 2015-12 2015-12-31 FELONY BRONX BAR/NIGHT CLUB COMPLETED 40.82885 -73.91666
2015 2015-12 2015-12-31 FELONY MANHATTAN OTHER COMPLETED 40.80261 -73.94505
2015 2015-12 2015-12-31 MISDEMEANOR QUEENS RESIDENCE-HOUSE COMPLETED 40.65455 -73.72634
2015 2015-12 2015-12-31 MISDEMEANOR MANHATTAN OTHER COMPLETED 40.73800 -73.98789
2015 2015-12 2015-12-31 FELONY BROOKLYN DRUG STORE ATTEMPTED 40.66502 -73.95711
2015 2015-12 2015-12-31 MISDEMEANOR MANHATTAN STREET COMPLETED 40.72020 -73.98874

This provides a good preview of the confirmed consistency of the entries, e.g., ‘MANHATTAN’ appears in all caps, without abbreviation or in alternative format throughout the data.

Conclusion:

The NYPD historical crime data contains many dimensions of information that would be valuable in determining whether crime is on the rise in the five boroughs over the past 10 years. However, it is unlikely to be the sole source of information.

Some considerations:

Sample Visualizations:

Below are some general visualizations of the data following some light manipulation and formatting:

b_data <- nypd_data %>%                         # create annual totals
  group_by(Borough_Name, Year) %>% 
  summarise(Reported_Crimes = n())
  
ggplot(b_data, aes(x = Year, y = Reported_Crimes,  fill = Borough_Name)) + 
  geom_bar(stat = "identity", position = "dodge", alpha = .9) +
  my_yr_plot_breaks + 
  scale_y_continuous(labels = scales::comma,
                     breaks = pretty(b_data$Reported_Crimes, n = 8)) + 
  my_theme +
  labs(title = "NYPD Complaints:",
       subtitle = "Raw Counts 2006-16")

Below is a mapping of NYPD’s 2016 complaints with status “completed” in Manhattan (i.e. it excludes attempted crimes which failed or were interrupted prematurely). The Longitude and Latitude GPS coordinates within the data allow one to place markers on the map that coincide with the rough locations where the complaints occurred.

If viewing online, zooming in on the map will cause the clustered points to breakdown further revealing finer detail. Clicking on a point will provide a label indicating the type of crime category: Violation, Misdemeanor, or Felony.

m_data <- nypd_data %>%
  filter(grepl('2016', Year),
         grepl('COMPLETED', Attmptd_Cmpltd),
         grepl('MANHATTAN', Borough_Name))

leaflet(data = m_data) %>% addTiles() %>%
  addMarkers(~Longitude, ~Latitude,
             popup = ~Premises_Type,
             label = ~Crime_Category,
             clusterOptions = markerClusterOptions(),
             options = popupOptions(style = list("color" = "red",
                                                 "font-size" = "12px")))
manhat_only <- nypd_data %>%
  filter(grepl('MANHATTAN', Borough_Name)) %>% 
  group_by(Crime_Category, Year) %>% 
  summarise(Reported_Crimes = n())
  
ggplot(manhat_only, aes(x = Year, y = Reported_Crimes,  fill = Crime_Category)) + 
  geom_bar(stat = "identity", position = "dodge", alpha = .9) +
  my_yr_plot_breaks +
  scale_y_continuous(labels = scales::comma,
                     breaks = pretty(b_data$Reported_Crimes, n = 8)) + 
  my_theme +
  labs(title = "NYPD Complaints by Type: Manhattan",
       subtitle = "Raw Counts 2006-16")