New York City Crimes Analysis

Introduction

Problem Statement

The Trump administration has made a priority of reducing violent crime. As the the nation’s largest city, New York has been described as a violent city in the press, on TV and in the movies. In this project, we analyze the crimes reported in all 4 boroughs of New York City from 2014 to 2015, and try to figure out where and when the crimes happen and what factors have influence on a borough’s crime rate.

Data Source and Methodology

This project involves exploring New York City Crimes dataset that contains NYPD Complaint Data from 2014 to 2015. The dataset contains crime occurence time, location, status, type and level of offenses. We also collect demographic information, such as population, unemployment rate, poverty rate, median age and median household income, of the whole New York City and its five boroughs (Bronx, Brooklyn, Manhattan, Queens, Staten Island).

We prepare and clean the dataset then conduct exploratory data analysis and visualize the data. Data visualization can show us when most crimes happen and what kind of crimes they are. We uncover some information about crimes and demographic factors of the five boroughs of New York. We use interactive maps to show how crimes vary between blocks.

Mission

This project is intended to help the public and organizations have a better understanding on the crime situation of New York City. We hope this analysis help residents decide when and where it’s safe to go out alone. Organizations can use the data to see what factors can be improved to make the community safer.

Packages Required

The following packages are used in order to produce results throughout this project.

library(tidyr)      # used for tidying up data
library(dplyr)      # used for data manipulation
library(lubridate)  # used for transforming date
library(knitr)      # used for viewing data
library(leaflet)    # used for creating interactive maps
library(ggplot2)    # used for data visualization
library(gridExtra)  # used for arranging grid-based plots

Data Preparation

We perform the following procedures to get the data ready for analysis.

Data Description

The New York City Crimes data we used can be found at Kaggle in the form of csv files. The data is collected from NYC Open Data and contains reported crime to NYPD from 2014 to 2015. The following table shows column names and description. We rename columns with easier to understand names, and the new names are listed here.

OldColumnName	NewColumnName	ColumnDescription
crime_id	crime_id
CMPLNT_NUM	occurance_date	Randomly generated persistent ID for each complaint
CMPLNT_FR_DT	occurance_time	Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists)
CMPLNT_FR_TM	ending_date	Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists)
CMPLNT_TO_DT	ending_time	Ending date of occurrence for the reported event, if exact time of occurrence is unknown
CMPLNT_TO_TM	reported_date	Ending time of occurrence for the reported event, if exact time of occurrence is unknown
RPT_DT	offense_classification_code	Date event was reported to police
KY_CD	offense_classification_descriotion	Three digit offense classification code
OFNS_DESC	internal_classification_code	Description of offense corresponding with key code
PD_CD	internal_classification_description	Three digit internal classification code (more granular than Key Code)
PD_DESC	crime_status	Description of internal classification corresponding with PD code (more granular than Offense Description)
CRM_ATPT_CPTD_CD	level_of_offense	Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely
LAW_CAT_CD	type_of_jurisdiction	Level of offense: felony, misdemeanor, violation
JURIS_DESC	borough	Jurisdiction responsible for incident. Either internal, like Police, Transit, and Housing; or external, like Correction, Port Authority, etc.
BORO_NM	precienct	The name of the borough in which the incident occurred
ADDR_PCT_CD	specific_location	The precinct in which the incident occurred
LOC_OF_OCCUR_DESC	type_of_location	Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of
PREM_TYP_DESC	park_name	Specific description of premises; grocery store, residence, street, etc.
PARKS_NM	housing_name	Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included)
HADEVELOPT	x_coordinate	Name of NYCHA housing development of occurrence, if applicable
X_COORD_CD	y_coordinate	X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
Y_COORD_CD	latitude	Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
Latitude	longitude	Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
Longitude	location	Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)

We also collect demographic information for the five New York City boroughs from the following website and put them in a new table:

YCharts: unemployment rate
Wikipedia: population, land area
DataUSA: poverty rate, median age, median household income, median property value

Import and Clean Data

We run the following code to import the dataset downloaded from Kaggle into R. All columns are renamed into an easy to understand name. After inspecting the source data, we decide to remove the “crime_id”, “offense_classification_code”, “internal_classification_code”, “x_coordinate”, “y_coordinate” and “location” columns because they are redundant or unwanted.

nyc <- read.csv("NYPD_Complaint_Data_Historic.csv")
colnames(nyc) <- c("crime_id","occurance_date","occurance_time","ending_date","ending_time","reported_date","offense_classification_code","offense_classification_description","internal_classification_code","internal_classification_description","crime_status","level_of_offense","type_of_jurisdiction","borough","precienct","specific_location","type_of_location","park_name","housing_name","x_coordinate","y_coordinate","latitude","longitude","location")

nyc <- nyc[,-c(1,7,9,20,21,24)]

We then get and set components of “occurance_date” as “occurance_year”, “occurance_month”, “occurance_day” and “occurance_weekdays”. We also create “occurance_hour” from “occurance_time”. The duration of crime is calculated as “diff_ending.occurance”, and “diff_reported.occurance” is the time between the occurrence of crime and NYPD being reported. The dataset contains a few records before 2014, and we remove these records.

nyc <- nyc %>%
  mutate(occurance_year = year(mdy(occurance_date)),
         occurance_month = month(mdy(occurance_date)),
         occurance_day = day(mdy(occurance_date)),
         occurance_weekdays = weekdays(mdy(occurance_date)),
         occurance_hour = hour(hms(occurance_time)),
         diff_reported.occurance = difftime(mdy(reported_date),mdy(occurance_date),units = "day"),
         occurance_date_time = as.POSIXct(paste(occurance_date,occurance_time),format = "%m/%d/%Y %H:%M:%S"),
         ending_date_time = as.POSIXct(paste(ending_date,ending_time),format = "%m/%d/%Y %H:%M:%S"),
         diff_ending.occurance = round(difftime(ending_date_time,occurance_date_time,units = "hours"),digits = 2),
         weekends = ifelse(occurance_weekdays %in% c("Sunday","Saturday"),"Yes","No")
         ) %>% 
  filter(occurance_year == "2014" | occurance_year == "2015")

We calculate totoal number or crimes reported for each borough of New York City as “total_crime” in a new table.

table2 <- nyc %>% 
  group_by(borough) %>% 
  summarise(total_crime = length(occurance_date))
table2$total_crime <- as.numeric(table2$total_crime)

We then add the following information for each of the 5 boroughs:

Land area (in kilometers squared)
Population of the borough and its percentage from the total of New York City
White population, black population, asian population and other population of the borough and their percentage from total population of the borough
Unemployment rate
Poverty rate
Median age
Median household income (in US dollar)
Median property value (in US dollar)

total_poplutation <- rep(8175133,5)
population <- c(1385108,2504700,1585873,2230722,468730)
population_percent <- round(population/total_poplutation,digits = 4)
land_area <- c(110,180,59.1,280,152)
white <- c(386497,1072041,911073,886053,341677)
black <- c(505200,860083,246687,426683,49857)
asian <- c(50897,263519,180425,513317,35377)
other_or_mixed <- c(442514,309057,247688,404669,41819)
white_percent <- white/population
black_percent <- black/population
asian_percent <- asian/population
other_or_mixed_percent <- other_or_mixed/population
unemployment_rate <- c(0.961,0.825,0.659,0.683,0.129)
poverty_rate <- c(0.304,0.223,0.176,0.138,0.144)
median_age <- c(33.6,34.7,36.8,38.1,39.8)
median_household_income <- c(35176,51141,75575,60422,71622)
median_property_value <- c(368500,638500,867600,487400,457700)

table3 <- cbind(table2, land_area, population, population_percent, white, white_percent, black, black_percent, asian, asian_percent, other_or_mixed, other_or_mixed_percent, unemployment_rate, poverty_rate, median_age, median_household_income, median_property_value)

newrow1 <- data.frame(borough = "Whole New York",
                    total_crime = sum(table3$total_crime),
                    land_area = sum(table3$land_area),
                    population = sum(table3$population),
                    population_percent = sum(population_percent),
                    white = sum(table3$white),
                    white_percent = sum(table3$white)/total_poplutation[1],
                    black = sum(table3$black),
                    black_percent = sum(table3$black)/total_poplutation[1],
                    asian = sum(table3$asian),
                    asian_percent = sum(table3$asian)/total_poplutation[1],
                    other_or_mixed = sum(table3$other_or_mixed),
                    other_or_mixed_percent = sum(table3$other_or_mixed)/total_poplutation[1],
                    unemployment_rate = 0.658,
                    poverty_rate = 0.20,
                    median_age = 36,
                    median_household_income = 55752,median_property_value = 538300)  

#borough crime analysis table
bca <- rbind(table3,newrow1)

Final Data Preview

The following table is a snapshot of the cleaned dataset that contains all crime records.

(first 10 rows showed)
occurance_date	occurance_time	ending_date	ending_time	reported_date	offense_classification_description	internal_classification_description	crime_status	level_of_offense	type_of_jurisdiction	borough	precienct	specific_location	type_of_location	latitude	longitude	occurance_year	occurance_month	occurance_day	occurance_weekdays	occurance_hour	occurance_date_time	ending_date_time	diff_ending.occurance	weekends
12/31/2015	23:45:00			12/31/2015	FORGERY	FORGERY,ETC.,UNCLASSIFIED-FELO	COMPLETED	FELONY	N.Y. POLICE DEPT	BRONX	44	INSIDE	BAR/NIGHT CLUB	40.82885	-73.91666	2015	12	31	Thursday	23	2015-12-31 23:45:00	NA	NA	No
12/31/2015	23:36:00			12/31/2015	MURDER & NON-NEGL. MANSLAUGHTER		COMPLETED	FELONY	N.Y. POLICE DEPT	QUEENS	103	OUTSIDE		40.69734	-73.78456	2015	12	31	Thursday	23	2015-12-31 23:36:00	NA	NA	No
12/31/2015	23:30:00			12/31/2015	DANGEROUS DRUGS	CONTROLLED SUBSTANCE,INTENT TO	COMPLETED	FELONY	N.Y. POLICE DEPT	MANHATTAN	28		OTHER	40.80261	-73.94505	2015	12	31	Thursday	23	2015-12-31 23:30:00	NA	NA	No
12/31/2015	23:30:00			12/31/2015	ASSAULT 3 & RELATED OFFENSES	ASSAULT 3	COMPLETED	MISDEMEANOR	N.Y. POLICE DEPT	QUEENS	105	INSIDE	RESIDENCE-HOUSE	40.65455	-73.72634	2015	12	31	Thursday	23	2015-12-31 23:30:00	NA	NA	No
12/31/2015	23:25:00	12/31/2015	23:30:00	12/31/2015	ASSAULT 3 & RELATED OFFENSES	ASSAULT 3	COMPLETED	MISDEMEANOR	N.Y. POLICE DEPT	MANHATTAN	13	FRONT OF	OTHER	40.73800	-73.98789	2015	12	31	Thursday	23	2015-12-31 23:25:00	2015-12-31 23:30:00	0.08 hours	No
12/31/2015	23:18:00	12/31/2015	23:25:00	12/31/2015	FELONY ASSAULT	ASSAULT 2,1,UNCLASSIFIED	ATTEMPTED	FELONY	N.Y. POLICE DEPT	BROOKLYN	71	FRONT OF	DRUG STORE	40.66502	-73.95711	2015	12	31	Thursday	23	2015-12-31 23:18:00	2015-12-31 23:25:00	0.12 hours	No
12/31/2015	23:15:00			12/31/2015	DANGEROUS DRUGS	CONTROLLED SUBSTANCE, POSSESSI	COMPLETED	MISDEMEANOR	N.Y. POLICE DEPT	MANHATTAN	7	OPPOSITE OF	STREET	40.72020	-73.98874	2015	12	31	Thursday	23	2015-12-31 23:15:00	NA	NA	No
12/31/2015	23:15:00	12/31/2015	23:15:00	12/31/2015	DANGEROUS WEAPONS	WEAPONS POSSESSION 1 & 2	COMPLETED	FELONY	N.Y. POLICE DEPT	BRONX	46	FRONT OF	STREET	40.84571	-73.91040	2015	12	31	Thursday	23	2015-12-31 23:15:00	2015-12-31 23:15:00	0.00 hours	No
12/31/2015	23:15:00	12/31/2015	23:30:00	12/31/2015	ASSAULT 3 & RELATED OFFENSES	ASSAULT 3	COMPLETED	MISDEMEANOR	N.Y. POLICE DEPT	BRONX	48	INSIDE	RESIDENCE - APT. HOUSE	40.85671	-73.89190	2015	12	31	Thursday	23	2015-12-31 23:15:00	2015-12-31 23:30:00	0.25 hours	No
12/31/2015	23:10:00	12/31/2015	23:10:00	12/31/2015	PETIT LARCENY	LARCENY,PETIT FROM BUILDING,UN	COMPLETED	MISDEMEANOR	N.Y. POLICE DEPT	MANHATTAN	19	INSIDE	DRUG STORE	40.76562	-73.96362	2015	12	31	Thursday	23	2015-12-31 23:10:00	2015-12-31 23:10:00	0.00 hours	No

The following table is the dataset that contains demographic information of the five New York City boroughs.

borough	total_crime	land_area	population	population_percent	white	white_percent	black	black_percent	asian	asian_percent	other_or_mixed	other_or_mixed_percent	unemployment_rate	poverty_rate	median_age	median_household_income	median_property_value
BRONX	208966	110.0	1385108	0.1694	386497	0.2790374	505200	0.3647369	50897	0.0367459	442514	0.3194798	0.961	0.304	33.6	35176	368500
BROOKLYN	288804	180.0	2504700	0.3064	1072041	0.4280117	860083	0.3433876	263519	0.1052098	309057	0.1233908	0.825	0.223	34.7	51141	638500
MANHATTAN	223676	59.1	1585873	0.1940	911073	0.5744930	246687	0.1555528	180425	0.1137701	247688	0.1561840	0.659	0.176	36.8	75575	867600
QUEENS	193068	280.0	2230722	0.2729	886053	0.3972046	426683	0.1912757	513317	0.2301125	404669	0.1814072	0.683	0.138	38.1	60422	487400
STATEN ISLAND	44425	152.0	468730	0.0573	341677	0.7289420	49857	0.1063661	35377	0.0754742	41819	0.0892177	0.129	0.144	39.8	71622	457700
Whole New York	958939	781.1	8175133	1.0000	3597341	0.4400346	2088510	0.2554711	1043535	0.1276475	1445747	0.1768469	0.658	0.200	36.0	55752	538300

Exploratory Data Analysis

We create tables, charts and maps to explore the New York Crimes dataset.

Date and Time

Year and Month

The number of crimes in 2015 is decreased by 20,000 than 2014.

nyc %>% 
  ggplot(aes(as.character(x = occurance_year))) + 
  geom_bar() +
  ggtitle("New York Crime in 2014 & 2015") + xlab("year") + ylab("number of crime") +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.6)

By looking at the number of crimes occored each month from 2014 to 2015, we find that August has the highest number of crimes and February has the lowest. There are more crimes in summer (from May to October) than in winner (from November to next April). All 2015 months but December have slightly lower number of crimes than 2014 months. Overall, the difference between 2014 and 2015 is small.

plot1 <- nyc %>% 
  ggplot(aes(as.factor(x = occurance_month))) + 
  geom_bar() +
  ggtitle("Monthly New York Crime") + xlab("Month") + ylab("number of crime")

plot2 <- nyc %>% 
  ggplot(aes(as.factor(x = occurance_month),fill = as.factor(occurance_year))) +
  geom_bar(position = "fill") +
  xlab("Month") + ylab("proportion") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

grid.arrange(plot1, plot2, ncol = 2)

Day

We plot the number of crimes in each weekday. The following graph shows that Friday and Wednesday have more crimes than other days of the week, and Sunday and Monday have less. Even criminals don’t want to work on Sunday and have difficulties starting to “work” on Monday.

nyc$occurance_weekdays <- factor(nyc$occurance_weekdays, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

plot1 <- nyc %>% 
  ggplot(aes(as.factor(x = occurance_weekdays))) +
  geom_bar() +
  ggtitle("New York Crime - weekdays") + xlab("Weekdays") + ylab("number of crime") +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.6)

plot2 <- nyc %>% 
  ggplot(aes(as.factor(x = occurance_weekdays),fill = as.factor(occurance_year))) +
  geom_bar(position = "fill") +
  xlab("weekdays") + ylab("proportion") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

grid.arrange(plot1, plot2, ncol = 2)

We all know the night hours are dangerous. The following graph tells us that the night hours on weekends are more dangerous than weekdays’. The good news is that Saturday and Sunday have safer daytime than Monday to Friday.

nyc %>%
  ggplot(aes(x = occurance_hour, fill = weekends)) +
  geom_density(alpha = 0.8)

Hour

The following plot shows the number of crimes is much higher in the afternoon and evening than the morning. There are the fewest crimes around the sunrise time.

nyc %>% 
  ggplot(aes(as.factor(x = occurance_hour))) + 
  geom_bar() +
  ggtitle("New York Crime - hour") + xlab("hour") + ylab("number of crime") +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.6)

Crime Occurrence

Time between Crime Occurrence and Reporting

The following plot shows the number of days between crime occurrence date and the date event was reported to police. Almost of crimes in New York can be reported within 10 days after the occurrence. We also find that a large number of crimes are finishing reporting around 183 days or 360 days after their occurrence.

plot1 <- nyc %>% 
  ggplot(aes(diff_reported.occurance)) +
  geom_histogram(binwidth = 1) +
  ggtitle("Days between reported and occurrence time") + xlab("<=10 days") +
  scale_x_continuous(limits = c(-1,10))

diff1 <- nyc %>% 
  filter(diff_reported.occurance > 10) %>% 
  select(level_of_offense, offense_classification_description, borough, diff_reported.occurance)

plot2 <- diff1 %>% 
  ggplot(aes(diff_reported.occurance)) +
  geom_histogram(binwidth = 1) + 
  ggtitle("") + xlab(">10 days")

grid.arrange(plot1, plot2, ncol = 2)

We decide to use 183-day as a demarcation point. The follow plots analyze the difference between crimes that reported 10 to 183 days after occurrence and crimes reported more than 183 days after occurrence. The following plot compares the crimes’ levels of offense. We find that in general, the larger the time difference between crime occurrence date and reporting data, the more severe impairment the crime caused. When it comes to the specific offense level, most violations can be reported within 10 days, and misdemeanors can be reported within half-year. It takes longer time for felonies to be reported.

# <10days
diff1.short <- nyc %>% 
  filter(diff_reported.occurance <= 10) %>% 
  select(level_of_offense,offense_classification_description,borough,diff_reported.occurance)
diff1.short.borough <- diff1.short %>%
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime <- nyc %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime.short <- cbind(diff1.short.borough,borough.crime)[,-3]
colnames(borough.crime.short) <- c("borough","short.reported","total")
borough.crime.short <- borough.crime.short %>% 
  mutate(short.percent = short.reported/total)
within10.borough <- borough.crime.short %>% 
  ggplot(aes(x = borough, y = short.percent)) + 
  geom_bar(stat = "identity" ) +
  ggtitle("time between occurrence and reporting <=10 (Borough)") + 
  xlab("Borough") + ylab("report percent of whole")

# 10<days<183
diff1.mid <- nyc %>% 
  filter(diff_reported.occurance > 10 & diff_reported.occurance < 183) %>% 
  select(level_of_offense,offense_classification_description,borough,diff_reported.occurance)

half.level <- diff1.mid %>% 
  group_by(level_of_offense) %>% 
  summarise(n_level = n()) %>% 
  ggplot(aes(x = level_of_offense, y = n_level)) + geom_bar(stat = "identity") +
  ggtitle("10< time between occurrence and reporting <183 (Level of crime)") + 
  xlab("Level of offense") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.1)

half.description <- diff1.mid %>% 
  group_by(offense_classification_description) %>% 
  summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 10) %>% 
  ggplot(aes(x = reorder(offense_classification_description,-n_class),y = n_class)) +
  geom_bar(stat = "identity") +
  ggtitle("10< time between occurrence and reporting <183 (offense classification - top 10)") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) + 
  coord_flip()
diff1.mid.borough <- diff1.mid %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime <- nyc %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime.mid <- cbind(diff1.mid.borough,borough.crime)[,-3]
colnames(borough.crime.mid) <- c("borough","mid.reported","total")
borough.crime.mid <- borough.crime.mid %>% 
  mutate(mid.percent = mid.reported/total)
half.borough <- borough.crime.mid %>% 
  ggplot(aes(x = borough, y = mid.percent)) +
  geom_bar(stat = "identity") +
  ggtitle("10< difference between report time and occurrence time <183 (Borough)") + 
  xlab("Borough") + ylab("report percent of whole")

# days>=183
diff1.long <- nyc %>% 
  filter(diff_reported.occurance >= 183) %>%
  select(level_of_offense,offense_classification_description,borough,diff_reported.occurance)

whole.level <- diff1.long %>% 
  group_by(level_of_offense) %>% 
  summarise(n_level = n()) %>% 
  ggplot(aes(x = level_of_offense, y = n_level)) +
  geom_bar(stat = "identity") +
  ggtitle("time between occurrence and reporting >=183 (Level of crime)") + 
  xlab("Level of offense") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.1)

whole.description <- diff1.long %>% 
  group_by(offense_classification_description) %>% 
  summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 10) %>% 
  ggplot(aes(x = reorder(offense_classification_description,-n_class), y = n_class)) +
  geom_bar(stat = "identity") +
  ggtitle("time between occurrence and reporting >=183 (offense classification - top 10)") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) +
  coord_flip()

diff1.long.borough <- diff1.long %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime <- nyc %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime.long <- cbind(diff1.long.borough,borough.crime)[,-3]
colnames(borough.crime.long) <- c("borough","long.reported","total")
borough.crime.long <- borough.crime.long %>% 
  mutate(long.percent = long.reported/total)
whole.borough <- borough.crime.long %>% 
  ggplot(aes(x = borough,y = long.percent)) +
  geom_bar(stat = "identity") +
  ggtitle("time between occurrence and reporting >=183 (Borough)") + 
  xlab("Borough") + ylab("report percent of whole ")

# comparison
grid.arrange(half.level,whole.level)

We then analyze the difference in crime classification. The crimes that reported 10 to 183 days after occurrence have similar classification of proportion with the crimes that reported more than 183 days after occurrence.

grid.arrange(half.description,whole.description)

We also find that all five boroughs of New York have more than 90% crimes being reported within 10 days of occurrence.

grid.arrange(within10.borough,half.borough,whole.borough)

Time between crime occurrence and ending

Most of the crimes are finished within one day after occurrence.

nyc %>% ggplot(aes(diff_ending.occurance)) +
  geom_histogram() +
  scale_x_continuous(limits = c(-10,100))

We decide to take a closer look at the offense levels of crimes that have more than 24 hours between occurrence and ending. The plot below shows there are more misdemeanors than felonies, and there are much more felonies than violations. Overall, more serious crimes take longer time to commit.

diff2 <- nyc %>% 
  filter(diff_ending.occurance > 24) %>% 
  select(level_of_offense,offense_classification_description,borough,diff_ending.occurance)

diff2 %>% 
  group_by(level_of_offense) %>% 
  summarise(n_level = n()) %>% 
  ggplot(aes(x = level_of_offense,y = n_level)) + geom_bar(stat = "identity") +
  ggtitle("difference between ending and occurance time >1 day (Level of crime)") + 
  xlab("Level of offense") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6)

The following bar chart shows the classification of crimes that have more than 24 hours between occurrence and ending.

diff2 %>% 
  group_by(offense_classification_description) %>% summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 10) %>% 
  ggplot(aes(x = reorder(offense_classification_description, -n_class), y = n_class)) + 
  geom_bar(stat = "identity") +
  ggtitle("difference between ending and occurance time >1 day (offense classification - top 10)") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) +
  coord_flip()

We also find that among the five boroughs of New York, Brooklyn has the largest number of crimes that last for more than 24 hours.

diff2.borough <- diff2 %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
diff2.borough %>% 
  ggplot(aes(x = borough,y = n_borough)) +
  geom_bar(stat = "identity") +
  ggtitle("difference between ending and occurrence time>1 day (Borough)") + 
  xlab("Borough") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6)

Borough

The plot below shows the proportion of races for the five boroughs of New York City.

bca %>%
  gather(key,value, white_percent, black_percent, asian_percent, other_or_mixed_percent) %>%
  ggplot(aes(x = borough, y = value, fill = key)) +
  geom_bar(position = "fill", stat = "identity") +
  ggtitle("Proportion of races (Borough)") + 
  xlab("Borough") + ylab("proportion")

We plot the median age of the five boroughs in the following bar chart. Staten Island has the highest median age among the five.

bca %>%
  ggplot(aes(x = borough, y = median_age)) +
  geom_bar(stat = "identity")

By plotting the unemployment rate and poverty rate on the same graph, we find that even though Staten Island has the lowest unemployment rate, its poverty rate isn’t he lowest. Queens has the lowest poverty rate.

bca %>%
  gather(key,value, unemployment_rate, poverty_rate) %>%
  ggplot(aes(x = borough, y = value, col = key)) +
  geom_point(size = 5, alpha = 0.8)

When it comes to median household income and median property value, we can see that in general people in manhattan have higher income and more assets.

ggplot(bca, aes(x = median_household_income, y = median_property_value, col = borough)) +
  geom_point(size = 5)

We compare the crimes bewteen five boroughs of New York City. The plot below shows the number of crimes of each offense level and the proportion for the five boroughs. Felonies have a higher proportion in Queens compared to the proportion in other borough, and violations have a higher proportion in Staten Island.

plot1 <- nyc %>%
  ggplot(aes(x = borough, fill = level_of_offense)) +
  geom_bar(position = "dodge") + 
  guides(fill = FALSE)

plot2 <- nyc %>%
  ggplot(aes(x = borough, fill = level_of_offense)) +
  geom_bar(position = "fill")

grid.arrange(plot1, plot2, ncol = 2)

The bar chart below shows Brooklyn has the largest number of crimes among the five New York boroughs, so we focus on Brooklyn in this section.

borough.crime <- nyc %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime %>% 
  ggplot(aes(x = borough, y = n_borough)) +
  geom_bar(stat = "identity") +
  ggtitle("Number of crime in different borough") + 
  xlab("Borough") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6)

Brooklyn

Looking at the top 10 crime type in Brooklyn, we find that petit larceny is the most common crimes.

brooklyn <- nyc %>% 
  filter(borough == "BROOKLYN") %>% select(level_of_offense,offense_classification_description,type_of_jurisdiction,type_of_location)
brooklyn %>% 
  group_by(offense_classification_description) %>% 
  summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 10) %>% 
  ggplot(aes(x = reorder(offense_classification_description, -n_class), y = n_class)) +
  geom_bar(stat = "identity") +
  ggtitle("top 10 crime in brooklyn") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) + 
  coord_flip()

When it comes to level of offense, the number of misdemeanors in Brooklyn is larger than felony and violation combined.

brooklyn %>% 
  group_by(level_of_offense) %>% 
  summarise(n_level = n()) %>% 
  ggplot(aes(x = level_of_offense,y = n_level)) + 
  geom_bar(stat = "identity") +
  ggtitle("crime in brooklyn group by different level of crime") + 
  xlab("Level of offense") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6)

If we look at jurisdiction, we’ll find that most crimes are under direct jurisdiction of NewYork police department, NewYork housing police and NewYork transit police.

brooklyn %>% 
  group_by(type_of_jurisdiction) %>% 
  summarise(n_jurisdiction = n()) %>% 
  ggplot(aes(x = type_of_jurisdiction, y = n_jurisdiction)) +
  geom_bar(stat = "identity") +
  ggtitle("crime in brooklyn group by different type of jurisdiction") + 
  xlab("type of juristiction") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) +
  coord_flip()

We plot the frequency of crime occurrence in different location. It shows that crimes in Brooklyn often happen on the street, in commercial building, and at residence house.

brooklyn %>% 
  group_by(type_of_location) %>% 
  summarise(n_location = n()) %>%
  arrange(desc(n_location)) %>% 
  head(n = 6) %>% 
  ggplot(aes(x = type_of_location, y = n_location)) + 
  geom_bar(stat = "identity") +
  ggtitle("crime in brooklyn group by different type of location") + 
  xlab("type of location") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) + 
  coord_flip()

Type and Level

We plot the top 5 crime types in New York. They are petit larceny, harrassment, assault 3 & related offenses, criminal Mischief & related of, and grand larceny.

nyc %>% 
  group_by(offense_classification_description) %>% 
  summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 5) %>% 
  ggplot(aes(x = reorder(offense_classification_description, -n_class), y = n_class)) +
  geom_bar(stat = "identity") +
  ggtitle("Top 5 crime type in New York") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) +
  coord_flip()

We then color code the chart above with crime offense level. All grand larcenies are felonies.

top5class <- nyc %>% 
  filter(offense_classification_description %in% c("PETIT LARCENY","HARRASSMENT 2","ASSAULT 3 & RELATED OFFENSES","CRIMINAL MISCHIEF & RELATED OF","GRAND LARCENY")) %>% 
  select(offense_classification_description,level_of_offense,borough,type_of_location,occurance_month,occurance_day,occurance_weekdays,occurance_hour)

top5class %>% 
  ggplot(aes(as.factor(x = offense_classification_description), fill = level_of_offense)) +
  geom_bar() +
  ggtitle("level of offense of 5 most frequency crime") +
  xlab("type of crime") + ylab("number of crime") +
  coord_flip()

Plotting the number of occurrence for the New York top 5 crime types in the five boroughs, we find that Manhattan has a series petit larceny issue, and Staten Island has a very low grand larceny occurrence.

top5class %>% 
  ggplot(aes(as.factor(x = offense_classification_description), fill = borough)) +
  geom_bar(position = "dodge") +
  ggtitle("New York Top 5 Crime Types in boroughs") +
  xlab("type of crime") + ylab("number of crime") +
  coord_flip()

The following table shows the top 4 types of location that crimes happen a lot and number of crimes happened in 2014-2015.

loc <- nyc %>% 
  group_by(type_of_location) %>% 
  summarise(n_location = n()) %>% 
  arrange(desc(n_location)) %>% 
  head(n = 4)
kable(loc)

type_of_location	n_location
STREET	295257
RESIDENCE - APT. HOUSE	207848
RESIDENCE-HOUSE	88019
RESIDENCE - PUBLIC HOUSING	72818

We create a bar chart for the table above and color code it with the top 5 crime types. Sadly, we find that residence have most harrassments 2 and assault 3 crimes.

top5class %>% 
  filter(type_of_location %in% c("STREET","RESIDENCE - APT. HOUSE","RESIDENCE-HOUSE","RESIDENCE - PUBLIC HOUSING")) %>% 
  ggplot(aes(x = type_of_location, fill = offense_classification_description)) +
  geom_bar(position = "dodge") +
  ggtitle("level of offense of 5 most frequency crime happened in top 4 frequency crime location") +
  xlab("location") + ylab("number of crime") +
  coord_flip()

According to the 4 plots below, the top 5 crime types are more likely to happen in May to October, on first day of each month, on Friday, and in the afternoon. The patten is similar with the patten of all crimes.

top5class %>%
  ggplot(aes(as.factor(x = occurance_month), fill = offense_classification_description)) +
  geom_bar() +
  ggtitle("Monthly New York Crime") +
  xlab("Month") + ylab("number of crime") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

top5class %>% 
  ggplot(aes(as.factor(x = occurance_day), fill = offense_classification_description)) +
  geom_bar() +
  ggtitle("Daily New York Crime") + 
  xlab("Day") + ylab("number of crime") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

top5class %>%
  ggplot(aes(as.factor(x = occurance_weekdays), fill = offense_classification_description)) +
  geom_bar() +
  ggtitle("Weekly New York Crime") +
  xlab("Weekdays") + ylab("number of crime") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

top5class %>% 
  ggplot(aes(as.factor(x = occurance_hour), fill = offense_classification_description)) +
  geom_bar() +
  ggtitle("Hourly New York Crime") +
  xlab("hour") + ylab("number of crime") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

Interactive Maps

We plot the crimes on the map of New York City. (To reduce loading time, we randomly sample 8000 records for each map.)

Crimes occurred in 2014 and 2015

The following map shows crimes occurred in 2014. Each point represents an occurrance of crime.

nyc2014 <- nyc %>% 
  filter(occurance_year == "2014")

crime.distribution2014 <- sample_n(nyc2014, 8e3) ##8000 point

leaflet(data = crime.distribution2014) %>% 
  addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircleMarkers(~ longitude, ~latitude, radius = 0.005, color = "yellow", fillOpacity = 0.1)

The following map shows crimes occurred in 2015. Each point represents an occurrance of crime.

nyc2015 <- nyc %>% 
  filter(occurance_year == "2015")

crime.distribution2015 <- sample_n(nyc2015, 8e3) #8000 point

leaflet(data = crime.distribution2015) %>% 
  addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircleMarkers(~ longitude, ~latitude, radius = 0.005, color = "blue", fillOpacity = 0.1)

Date and time of occurrance

Click the numbers on the map below will zoom in the map and show more detail. If you see location pins, you should be able to mouse hover over the pin and see the occurrance date and time of the crime that occurred here. You can select which year to show at the top right hand corner of the map.

nyc.s <- sample_n(nyc,8000)   #ramdonly pick 8000 point 
nyc.s <- na.omit(nyc.s)
nyc.s.year <- split(nyc.s,nyc.s$occurance_year)  ###split by year

l <- leaflet() %>% addTiles()

names(nyc.s.year) %>%
  purrr::walk( function(year) {
    l <<- l %>%
      addMarkers(data = nyc.s.year[[year]],
                 lng = ~longitude, lat = ~latitude,
                 label = ~as.character(occurance_date_time),
                 popup = ~as.character(occurance_date_time),
                 group = year,
                 clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F),
                 labelOptions = labelOptions(noHide = F, direction = 'auto'))
  })

l %>%
  addLayersControl(
    overlayGroups = names(nyc.s.year),
    options = layersControlOptions(collapsed = FALSE)
  )

Crime type

Click the numbers on the map below will zoom in the map and show more detail. If you see location pins, you should be able to mouse hover over the pin and see the type of the crime that occurred here. You can select which level of offense to show at the top right hand corner of the map.

nyc.s <- sample_n(nyc,8000)   #ramdonly pick 8000 point 
nyc.s <- na.omit(nyc.s)
nyc.s.level <- split(nyc.s,nyc.s$level_of_offense)

l <- leaflet() %>% addTiles()

names(nyc.s.level) %>%
  purrr::walk( function(level) {
    l <<- l %>%
      addMarkers(data = nyc.s.level[[level]],
                 lng = ~longitude, lat = ~latitude,
                 label = ~as.character(offense_classification_description),
                 popup = ~as.character(offense_classification_description),
                 group = level,
                 clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F),
                 labelOptions = labelOptions(noHide = F, direction = 'auto'))
  })


l %>%
  addLayersControl(
    overlayGroups = names(nyc.s.level),
    options = layersControlOptions(collapsed = FALSE)
  )

Type of Location

Click the numbers on the map below will zoom in the map and show more detail. If you see location pins, you should be able to mouse hover over the pin and see the type of location that the crime occurred at. You can select which borough to show at the top right hand corner of the map.

nyc.s <- sample_n(nyc,8000)   #ramdonly pick 8000 point 
nyc.s <- na.omit(nyc.s)
nyc.s.borough <- split(nyc.s,nyc.s$borough)

l <- leaflet() %>% addTiles()

names(nyc.s.borough) %>%
  purrr::walk( function(borough) {
    l <<- l %>%
      addMarkers(data = nyc.s.borough[[borough]],
                 lng = ~longitude, lat = ~latitude,
                 label = ~as.character(type_of_location),
                 popup = ~as.character(type_of_location),
                 group = borough,
                 clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F),
                 labelOptions = labelOptions(noHide = F, direction = 'auto'))
  })

l %>%
  addLayersControl(
    overlayGroups = names(nyc.s.borough),
    options = layersControlOptions(collapsed = FALSE)
  )

Summary

To uncover new information, we go through and classify all of the variables and add new information to the data set. Multiple tables and charts are created to help understand the data and discover information, even knowledge. Based on our analysis, we are able to find the following findings that might be useful to most people:

Time and Trend

The number of crimes in New York City was is decreased from 2014 to 2015.
There are more crimes in summer than in winner.
More crimes occurred on the last few weekdays and first day of the month. Less crimes occurred at the ending of weekends and beginning of weekdays.
The night hours of weekends are more dangerous than weekdays’.
The number of crimes is generally lower in the morning.

Crime Type and Level

The top 3 crime types at New York City are larceny, harrassment and assault.
All grand larcenies are felonies.
A hugh number of harrassments and assaults happen in residence.

Crime Occurrence

It takes longer time for felonies to be reported to the police than misdemeanors and violations.
More than 90% of crimes can be reported within 10 days of occurrence.
Most crimes end within 24-hour after the beginning.
Overall, more serious crimes take longer time to commit.

Borough

Brooklyn has the largest number of crimes among the five New York boroughs.
Staten Island has the lowest poverty rate and number of crimes.
Manhattan has a series petit larceny issue.

Implications

This analyis can be used by individuals and organizations to gain an understanding of the crime situation of New York City.

Limitations

This analyis was limited by the time span of the data set and lack of data mining. With only two-year of data, our yearly and monthly analysis might be biased. We can’t analyze if there is a downward trend in crime over the years and how crime in New York City changes. We collect demographic information for the five New York City boroughs, but the sample size is too small to conduct data mining as we planned. Using crime data in the past decades with demographic information may create a data set that contains enough data for data mining about which demographic factor has impact on crime rates.

New York City Crimes Analysis

Pengzu Chen, Yiyin Li

April 22, 2018

Introduction

Problem Statement

Data Source and Methodology

Mission

Packages Required

Data Preparation

Data Description

Import and Clean Data

Final Data Preview

Exploratory Data Analysis

Date and Time

Year and Month

Day

Hour

Crime Occurrence

Time between Crime Occurrence and Reporting

Time between crime occurrence and ending

Borough

Brooklyn

Type and Level

Interactive Maps

Crimes occurred in 2014 and 2015

Date and time of occurrance

Crime type

Type of Location

Summary

Time and Trend

Crime Type and Level

Crime Occurrence

Borough

Implications

Limitations