Introduction

Problem Statement

The Trump administration has made a priority of reducing violent crime. As the the nation’s largest city, New York has been described as a violent city in the press, on TV and in the movies. In this project, we analyze the crimes reported in all 4 boroughs of New York City from 2014 to 2015, and try to figure out where and when the crimes happen and what factors have influence on a borough’s crime rate.

Data Source and Methodology

This project involves exploring New York City Crimes dataset that contains NYPD Complaint Data from 2014 to 2015. The dataset contains crime occurence time, location, status, type and level of offenses. We also collect demographic information, such as population, unemployment rate, poverty rate, median age and median household income, of the whole New York City and its five boroughs (Bronx, Brooklyn, Manhattan, Queens, Staten Island).

We prepare and clean the dataset then conduct exploratory data analysis and visualize the data. Data visualization can show us when most crimes happen and what kind of crimes they are. We uncover some information about crimes and demographic factors of the five boroughs of New York. We use interactive maps to show how crimes vary between blocks.

Mission

This project is intended to help the public and organizations have a better understanding on the crime situation of New York City. We hope this analysis help residents decide when and where it’s safe to go out alone. Organizations can use the data to see what factors can be improved to make the community safer.

Packages Required

The following packages are used in order to produce results throughout this project.

library(tidyr)      # used for tidying up data
library(dplyr)      # used for data manipulation
library(lubridate)  # used for transforming date
library(knitr)      # used for viewing data
library(leaflet)    # used for creating interactive maps
library(ggplot2)    # used for data visualization
library(gridExtra)  # used for arranging grid-based plots

Data Preparation

We perform the following procedures to get the data ready for analysis.

Data Description

The New York City Crimes data we used can be found at Kaggle in the form of csv files. The data is collected from NYC Open Data and contains reported crime to NYPD from 2014 to 2015. The following table shows column names and description. We rename columns with easier to understand names, and the new names are listed here.

OldColumnName NewColumnName ColumnDescription
crime_id crime_id
CMPLNT_NUM occurance_date Randomly generated persistent ID for each complaint
CMPLNT_FR_DT occurance_time Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists)
CMPLNT_FR_TM ending_date Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists)
CMPLNT_TO_DT ending_time Ending date of occurrence for the reported event, if exact time of occurrence is unknown
CMPLNT_TO_TM reported_date Ending time of occurrence for the reported event, if exact time of occurrence is unknown
RPT_DT offense_classification_code Date event was reported to police
KY_CD offense_classification_descriotion Three digit offense classification code
OFNS_DESC internal_classification_code Description of offense corresponding with key code
PD_CD internal_classification_description Three digit internal classification code (more granular than Key Code)
PD_DESC crime_status Description of internal classification corresponding with PD code (more granular than Offense Description)
CRM_ATPT_CPTD_CD level_of_offense Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely
LAW_CAT_CD type_of_jurisdiction Level of offense: felony, misdemeanor, violation
JURIS_DESC borough Jurisdiction responsible for incident. Either internal, like Police, Transit, and Housing; or external, like Correction, Port Authority, etc.
BORO_NM precienct The name of the borough in which the incident occurred
ADDR_PCT_CD specific_location The precinct in which the incident occurred
LOC_OF_OCCUR_DESC type_of_location Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of
PREM_TYP_DESC park_name Specific description of premises; grocery store, residence, street, etc.
PARKS_NM housing_name Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included)
HADEVELOPT x_coordinate Name of NYCHA housing development of occurrence, if applicable
X_COORD_CD y_coordinate X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
Y_COORD_CD latitude Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
Latitude longitude Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
Longitude location Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)

We also collect demographic information for the five New York City boroughs from the following website and put them in a new table:

  • YCharts: unemployment rate
  • Wikipedia: population, land area
  • DataUSA: poverty rate, median age, median household income, median property value

Import and Clean Data

We run the following code to import the dataset downloaded from Kaggle into R. All columns are renamed into an easy to understand name. After inspecting the source data, we decide to remove the “crime_id”, “offense_classification_code”, “internal_classification_code”, “x_coordinate”, “y_coordinate” and “location” columns because they are redundant or unwanted.

nyc <- read.csv("NYPD_Complaint_Data_Historic.csv")
colnames(nyc) <- c("crime_id","occurance_date","occurance_time","ending_date","ending_time","reported_date","offense_classification_code","offense_classification_description","internal_classification_code","internal_classification_description","crime_status","level_of_offense","type_of_jurisdiction","borough","precienct","specific_location","type_of_location","park_name","housing_name","x_coordinate","y_coordinate","latitude","longitude","location")

nyc <- nyc[,-c(1,7,9,20,21,24)]

We then get and set components of “occurance_date” as “occurance_year”, “occurance_month”, “occurance_day” and “occurance_weekdays”. We also create “occurance_hour” from “occurance_time”. The duration of crime is calculated as “diff_ending.occurance”, and “diff_reported.occurance” is the time between the occurrence of crime and NYPD being reported. The dataset contains a few records before 2014, and we remove these records.

nyc <- nyc %>%
  mutate(occurance_year = year(mdy(occurance_date)),
         occurance_month = month(mdy(occurance_date)),
         occurance_day = day(mdy(occurance_date)),
         occurance_weekdays = weekdays(mdy(occurance_date)),
         occurance_hour = hour(hms(occurance_time)),
         diff_reported.occurance = difftime(mdy(reported_date),mdy(occurance_date),units = "day"),
         occurance_date_time = as.POSIXct(paste(occurance_date,occurance_time),format = "%m/%d/%Y %H:%M:%S"),
         ending_date_time = as.POSIXct(paste(ending_date,ending_time),format = "%m/%d/%Y %H:%M:%S"),
         diff_ending.occurance = round(difftime(ending_date_time,occurance_date_time,units = "hours"),digits = 2),
         weekends = ifelse(occurance_weekdays %in% c("Sunday","Saturday"),"Yes","No")
         ) %>% 
  filter(occurance_year == "2014" | occurance_year == "2015")

We calculate totoal number or crimes reported for each borough of New York City as “total_crime” in a new table.

table2 <- nyc %>% 
  group_by(borough) %>% 
  summarise(total_crime = length(occurance_date))
table2$total_crime <- as.numeric(table2$total_crime)

We then add the following information for each of the 5 boroughs:

  • Land area (in kilometers squared)
  • Population of the borough and its percentage from the total of New York City
  • White population, black population, asian population and other population of the borough and their percentage from total population of the borough
  • Unemployment rate
  • Poverty rate
  • Median age
  • Median household income (in US dollar)
  • Median property value (in US dollar)
total_poplutation <- rep(8175133,5)
population <- c(1385108,2504700,1585873,2230722,468730)
population_percent <- round(population/total_poplutation,digits = 4)
land_area <- c(110,180,59.1,280,152)
white <- c(386497,1072041,911073,886053,341677)
black <- c(505200,860083,246687,426683,49857)
asian <- c(50897,263519,180425,513317,35377)
other_or_mixed <- c(442514,309057,247688,404669,41819)
white_percent <- white/population
black_percent <- black/population
asian_percent <- asian/population
other_or_mixed_percent <- other_or_mixed/population
unemployment_rate <- c(0.961,0.825,0.659,0.683,0.129)
poverty_rate <- c(0.304,0.223,0.176,0.138,0.144)
median_age <- c(33.6,34.7,36.8,38.1,39.8)
median_household_income <- c(35176,51141,75575,60422,71622)
median_property_value <- c(368500,638500,867600,487400,457700)

table3 <- cbind(table2, land_area, population, population_percent, white, white_percent, black, black_percent, asian, asian_percent, other_or_mixed, other_or_mixed_percent, unemployment_rate, poverty_rate, median_age, median_household_income, median_property_value)

newrow1 <- data.frame(borough = "Whole New York",
                    total_crime = sum(table3$total_crime),
                    land_area = sum(table3$land_area),
                    population = sum(table3$population),
                    population_percent = sum(population_percent),
                    white = sum(table3$white),
                    white_percent = sum(table3$white)/total_poplutation[1],
                    black = sum(table3$black),
                    black_percent = sum(table3$black)/total_poplutation[1],
                    asian = sum(table3$asian),
                    asian_percent = sum(table3$asian)/total_poplutation[1],
                    other_or_mixed = sum(table3$other_or_mixed),
                    other_or_mixed_percent = sum(table3$other_or_mixed)/total_poplutation[1],
                    unemployment_rate = 0.658,
                    poverty_rate = 0.20,
                    median_age = 36,
                    median_household_income = 55752,median_property_value = 538300)  

#borough crime analysis table
bca <- rbind(table3,newrow1)

Final Data Preview

The following table is a snapshot of the cleaned dataset that contains all crime records.

(first 10 rows showed)
occurance_date occurance_time ending_date ending_time reported_date offense_classification_description internal_classification_description crime_status level_of_offense type_of_jurisdiction borough precienct specific_location type_of_location park_name housing_name latitude longitude occurance_year occurance_month occurance_day occurance_weekdays occurance_hour diff_reported.occurance occurance_date_time ending_date_time diff_ending.occurance weekends
12/31/2015 23:45:00 12/31/2015 FORGERY FORGERY,ETC.,UNCLASSIFIED-FELO COMPLETED FELONY N.Y. POLICE DEPT BRONX 44 INSIDE BAR/NIGHT CLUB 40.82885 -73.91666 2015 12 31 Thursday 23 0 days 2015-12-31 23:45:00 NA NA No
12/31/2015 23:36:00 12/31/2015 MURDER & NON-NEGL. MANSLAUGHTER COMPLETED FELONY N.Y. POLICE DEPT QUEENS 103 OUTSIDE 40.69734 -73.78456 2015 12 31 Thursday 23 0 days 2015-12-31 23:36:00 NA NA No
12/31/2015 23:30:00 12/31/2015 DANGEROUS DRUGS CONTROLLED SUBSTANCE,INTENT TO COMPLETED FELONY N.Y. POLICE DEPT MANHATTAN 28 OTHER 40.80261 -73.94505 2015 12 31 Thursday 23 0 days 2015-12-31 23:30:00 NA NA No
12/31/2015 23:30:00 12/31/2015 ASSAULT 3 & RELATED OFFENSES ASSAULT 3 COMPLETED MISDEMEANOR N.Y. POLICE DEPT QUEENS 105 INSIDE RESIDENCE-HOUSE 40.65455 -73.72634 2015 12 31 Thursday 23 0 days 2015-12-31 23:30:00 NA NA No
12/31/2015 23:25:00 12/31/2015 23:30:00 12/31/2015 ASSAULT 3 & RELATED OFFENSES ASSAULT 3 COMPLETED MISDEMEANOR N.Y. POLICE DEPT MANHATTAN 13 FRONT OF OTHER 40.73800 -73.98789 2015 12 31 Thursday 23 0 days 2015-12-31 23:25:00 2015-12-31 23:30:00 0.08 hours No
12/31/2015 23:18:00 12/31/2015 23:25:00 12/31/2015 FELONY ASSAULT ASSAULT 2,1,UNCLASSIFIED ATTEMPTED FELONY N.Y. POLICE DEPT BROOKLYN 71 FRONT OF DRUG STORE 40.66502 -73.95711 2015 12 31 Thursday 23 0 days 2015-12-31 23:18:00 2015-12-31 23:25:00 0.12 hours No
12/31/2015 23:15:00 12/31/2015 DANGEROUS DRUGS CONTROLLED SUBSTANCE, POSSESSI COMPLETED MISDEMEANOR N.Y. POLICE DEPT MANHATTAN 7 OPPOSITE OF STREET 40.72020 -73.98874 2015 12 31 Thursday 23 0 days 2015-12-31 23:15:00 NA NA No
12/31/2015 23:15:00 12/31/2015 23:15:00 12/31/2015 DANGEROUS WEAPONS WEAPONS POSSESSION 1 & 2 COMPLETED FELONY N.Y. POLICE DEPT BRONX 46 FRONT OF STREET 40.84571 -73.91040 2015 12 31 Thursday 23 0 days 2015-12-31 23:15:00 2015-12-31 23:15:00 0.00 hours No
12/31/2015 23:15:00 12/31/2015 23:30:00 12/31/2015 ASSAULT 3 & RELATED OFFENSES ASSAULT 3 COMPLETED MISDEMEANOR N.Y. POLICE DEPT BRONX 48 INSIDE RESIDENCE - APT. HOUSE 40.85671 -73.89190 2015 12 31 Thursday 23 0 days 2015-12-31 23:15:00 2015-12-31 23:30:00 0.25 hours No
12/31/2015 23:10:00 12/31/2015 23:10:00 12/31/2015 PETIT LARCENY LARCENY,PETIT FROM BUILDING,UN COMPLETED MISDEMEANOR N.Y. POLICE DEPT MANHATTAN 19 INSIDE DRUG STORE 40.76562 -73.96362 2015 12 31 Thursday 23 0 days 2015-12-31 23:10:00 2015-12-31 23:10:00 0.00 hours No

The following table is the dataset that contains demographic information of the five New York City boroughs.

borough total_crime land_area population population_percent white white_percent black black_percent asian asian_percent other_or_mixed other_or_mixed_percent unemployment_rate poverty_rate median_age median_household_income median_property_value
BRONX 208966 110.0 1385108 0.1694 386497 0.2790374 505200 0.3647369 50897 0.0367459 442514 0.3194798 0.961 0.304 33.6 35176 368500
BROOKLYN 288804 180.0 2504700 0.3064 1072041 0.4280117 860083 0.3433876 263519 0.1052098 309057 0.1233908 0.825 0.223 34.7 51141 638500
MANHATTAN 223676 59.1 1585873 0.1940 911073 0.5744930 246687 0.1555528 180425 0.1137701 247688 0.1561840 0.659 0.176 36.8 75575 867600
QUEENS 193068 280.0 2230722 0.2729 886053 0.3972046 426683 0.1912757 513317 0.2301125 404669 0.1814072 0.683 0.138 38.1 60422 487400
STATEN ISLAND 44425 152.0 468730 0.0573 341677 0.7289420 49857 0.1063661 35377 0.0754742 41819 0.0892177 0.129 0.144 39.8 71622 457700
Whole New York 958939 781.1 8175133 1.0000 3597341 0.4400346 2088510 0.2554711 1043535 0.1276475 1445747 0.1768469 0.658 0.200 36.0 55752 538300

Exploratory Data Analysis

We create tables, charts and maps to explore the New York Crimes dataset.

Date and Time

Year and Month

The number of crimes in 2015 is decreased by 20,000 than 2014.

nyc %>% 
  ggplot(aes(as.character(x = occurance_year))) + 
  geom_bar() +
  ggtitle("New York Crime in 2014 & 2015") + xlab("year") + ylab("number of crime") +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.6)

By looking at the number of crimes occored each month from 2014 to 2015, we find that August has the highest number of crimes and February has the lowest. There are more crimes in summer (from May to October) than in winner (from November to next April). All 2015 months but December have slightly lower number of crimes than 2014 months. Overall, the difference between 2014 and 2015 is small.

plot1 <- nyc %>% 
  ggplot(aes(as.factor(x = occurance_month))) + 
  geom_bar() +
  ggtitle("Monthly New York Crime") + xlab("Month") + ylab("number of crime")

plot2 <- nyc %>% 
  ggplot(aes(as.factor(x = occurance_month),fill = as.factor(occurance_year))) +
  geom_bar(position = "fill") +
  xlab("Month") + ylab("proportion") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

grid.arrange(plot1, plot2, ncol = 2)

Day

We plot the number of crimes in each weekday. The following graph shows that Friday and Wednesday have more crimes than other days of the week, and Sunday and Monday have less. Even criminals don’t want to work on Sunday and have difficulties starting to “work” on Monday.

nyc$occurance_weekdays <- factor(nyc$occurance_weekdays, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

plot1 <- nyc %>% 
  ggplot(aes(as.factor(x = occurance_weekdays))) +
  geom_bar() +
  ggtitle("New York Crime - weekdays") + xlab("Weekdays") + ylab("number of crime") +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.6)

plot2 <- nyc %>% 
  ggplot(aes(as.factor(x = occurance_weekdays),fill = as.factor(occurance_year))) +
  geom_bar(position = "fill") +
  xlab("weekdays") + ylab("proportion") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

grid.arrange(plot1, plot2, ncol = 2)

We all know the night hours are dangerous. The following graph tells us that the night hours on weekends are more dangerous than weekdays’. The good news is that Saturday and Sunday have safer daytime than Monday to Friday.

nyc %>%
  ggplot(aes(x = occurance_hour, fill = weekends)) +
  geom_density(alpha = 0.8)

Hour

The following plot shows the number of crimes is much higher in the afternoon and evening than the morning. There are the fewest crimes around the sunrise time.

nyc %>% 
  ggplot(aes(as.factor(x = occurance_hour))) + 
  geom_bar() +
  ggtitle("New York Crime - hour") + xlab("hour") + ylab("number of crime") +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.6)

Crime Occurrence

Time between Crime Occurrence and Reporting

The following plot shows the number of days between crime occurrence date and the date event was reported to police. Almost of crimes in New York can be reported within 10 days after the occurrence. We also find that a large number of crimes are finishing reporting around 183 days or 360 days after their occurrence.

plot1 <- nyc %>% 
  ggplot(aes(diff_reported.occurance)) +
  geom_histogram(binwidth = 1) +
  ggtitle("Days between reported and occurrence time") + xlab("<=10 days") +
  scale_x_continuous(limits = c(-1,10))

diff1 <- nyc %>% 
  filter(diff_reported.occurance > 10) %>% 
  select(level_of_offense, offense_classification_description, borough, diff_reported.occurance)

plot2 <- diff1 %>% 
  ggplot(aes(diff_reported.occurance)) +
  geom_histogram(binwidth = 1) + 
  ggtitle("") + xlab(">10 days")

grid.arrange(plot1, plot2, ncol = 2)

We decide to use 183-day as a demarcation point. The follow plots analyze the difference between crimes that reported 10 to 183 days after occurrence and crimes reported more than 183 days after occurrence. The following plot compares the crimes’ levels of offense. We find that in general, the larger the time difference between crime occurrence date and reporting data, the more severe impairment the crime caused. When it comes to the specific offense level, most violations can be reported within 10 days, and misdemeanors can be reported within half-year. It takes longer time for felonies to be reported.

# <10days
diff1.short <- nyc %>% 
  filter(diff_reported.occurance <= 10) %>% 
  select(level_of_offense,offense_classification_description,borough,diff_reported.occurance)
diff1.short.borough <- diff1.short %>%
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime <- nyc %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime.short <- cbind(diff1.short.borough,borough.crime)[,-3]
colnames(borough.crime.short) <- c("borough","short.reported","total")
borough.crime.short <- borough.crime.short %>% 
  mutate(short.percent = short.reported/total)
within10.borough <- borough.crime.short %>% 
  ggplot(aes(x = borough, y = short.percent)) + 
  geom_bar(stat = "identity" ) +
  ggtitle("time between occurrence and reporting <=10 (Borough)") + 
  xlab("Borough") + ylab("report percent of whole")

# 10<days<183
diff1.mid <- nyc %>% 
  filter(diff_reported.occurance > 10 & diff_reported.occurance < 183) %>% 
  select(level_of_offense,offense_classification_description,borough,diff_reported.occurance)

half.level <- diff1.mid %>% 
  group_by(level_of_offense) %>% 
  summarise(n_level = n()) %>% 
  ggplot(aes(x = level_of_offense, y = n_level)) + geom_bar(stat = "identity") +
  ggtitle("10< time between occurrence and reporting <183 (Level of crime)") + 
  xlab("Level of offense") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.1)

half.description <- diff1.mid %>% 
  group_by(offense_classification_description) %>% 
  summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 10) %>% 
  ggplot(aes(x = reorder(offense_classification_description,-n_class),y = n_class)) +
  geom_bar(stat = "identity") +
  ggtitle("10< time between occurrence and reporting <183 (offense classification - top 10)") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) + 
  coord_flip()
diff1.mid.borough <- diff1.mid %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime <- nyc %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime.mid <- cbind(diff1.mid.borough,borough.crime)[,-3]
colnames(borough.crime.mid) <- c("borough","mid.reported","total")
borough.crime.mid <- borough.crime.mid %>% 
  mutate(mid.percent = mid.reported/total)
half.borough <- borough.crime.mid %>% 
  ggplot(aes(x = borough, y = mid.percent)) +
  geom_bar(stat = "identity") +
  ggtitle("10< difference between report time and occurrence time <183 (Borough)") + 
  xlab("Borough") + ylab("report percent of whole")

# days>=183
diff1.long <- nyc %>% 
  filter(diff_reported.occurance >= 183) %>%
  select(level_of_offense,offense_classification_description,borough,diff_reported.occurance)

whole.level <- diff1.long %>% 
  group_by(level_of_offense) %>% 
  summarise(n_level = n()) %>% 
  ggplot(aes(x = level_of_offense, y = n_level)) +
  geom_bar(stat = "identity") +
  ggtitle("time between occurrence and reporting >=183 (Level of crime)") + 
  xlab("Level of offense") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.1)

whole.description <- diff1.long %>% 
  group_by(offense_classification_description) %>% 
  summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 10) %>% 
  ggplot(aes(x = reorder(offense_classification_description,-n_class), y = n_class)) +
  geom_bar(stat = "identity") +
  ggtitle("time between occurrence and reporting >=183 (offense classification - top 10)") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) +
  coord_flip()

diff1.long.borough <- diff1.long %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime <- nyc %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime.long <- cbind(diff1.long.borough,borough.crime)[,-3]
colnames(borough.crime.long) <- c("borough","long.reported","total")
borough.crime.long <- borough.crime.long %>% 
  mutate(long.percent = long.reported/total)
whole.borough <- borough.crime.long %>% 
  ggplot(aes(x = borough,y = long.percent)) +
  geom_bar(stat = "identity") +
  ggtitle("time between occurrence and reporting >=183 (Borough)") + 
  xlab("Borough") + ylab("report percent of whole ")

# comparison
grid.arrange(half.level,whole.level)

We then analyze the difference in crime classification. The crimes that reported 10 to 183 days after occurrence have similar classification of proportion with the crimes that reported more than 183 days after occurrence.

grid.arrange(half.description,whole.description)

We also find that all five boroughs of New York have more than 90% crimes being reported within 10 days of occurrence.

grid.arrange(within10.borough,half.borough,whole.borough)

Time between crime occurrence and ending

Most of the crimes are finished within one day after occurrence.

nyc %>% ggplot(aes(diff_ending.occurance)) +
  geom_histogram() +
  scale_x_continuous(limits = c(-10,100))

We decide to take a closer look at the offense levels of crimes that have more than 24 hours between occurrence and ending. The plot below shows there are more misdemeanors than felonies, and there are much more felonies than violations. Overall, more serious crimes take longer time to commit.

diff2 <- nyc %>% 
  filter(diff_ending.occurance > 24) %>% 
  select(level_of_offense,offense_classification_description,borough,diff_ending.occurance)

diff2 %>% 
  group_by(level_of_offense) %>% 
  summarise(n_level = n()) %>% 
  ggplot(aes(x = level_of_offense,y = n_level)) + geom_bar(stat = "identity") +
  ggtitle("difference between ending and occurance time >1 day (Level of crime)") + 
  xlab("Level of offense") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6)

The following bar chart shows the classification of crimes that have more than 24 hours between occurrence and ending.

diff2 %>% 
  group_by(offense_classification_description) %>% summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 10) %>% 
  ggplot(aes(x = reorder(offense_classification_description, -n_class), y = n_class)) + 
  geom_bar(stat = "identity") +
  ggtitle("difference between ending and occurance time >1 day (offense classification - top 10)") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) +
  coord_flip()

We also find that among the five boroughs of New York, Brooklyn has the largest number of crimes that last for more than 24 hours.

diff2.borough <- diff2 %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
diff2.borough %>% 
  ggplot(aes(x = borough,y = n_borough)) +
  geom_bar(stat = "identity") +
  ggtitle("difference between ending and occurrence time>1 day (Borough)") + 
  xlab("Borough") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6)

Borough

The plot below shows the proportion of races for the five boroughs of New York City.

bca %>%
  gather(key,value, white_percent, black_percent, asian_percent, other_or_mixed_percent) %>%
  ggplot(aes(x = borough, y = value, fill = key)) +
  geom_bar(position = "fill", stat = "identity") +
  ggtitle("Proportion of races (Borough)") + 
  xlab("Borough") + ylab("proportion")

We plot the median age of the five boroughs in the following bar chart. Staten Island has the highest median age among the five.

bca %>%
  ggplot(aes(x = borough, y = median_age)) +
  geom_bar(stat = "identity")

By plotting the unemployment rate and poverty rate on the same graph, we find that even though Staten Island has the lowest unemployment rate, its poverty rate isn’t he lowest. Queens has the lowest poverty rate.

bca %>%
  gather(key,value, unemployment_rate, poverty_rate) %>%
  ggplot(aes(x = borough, y = value, col = key)) +
  geom_point(size = 5, alpha = 0.8)

When it comes to median household income and median property value, we can see that in general people in manhattan have higher income and more assets.

ggplot(bca, aes(x = median_household_income, y = median_property_value, col = borough)) +
  geom_point(size = 5) 

We compare the crimes bewteen five boroughs of New York City. The plot below shows the number of crimes of each offense level and the proportion for the five boroughs. Felonies have a higher proportion in Queens compared to the proportion in other borough, and violations have a higher proportion in Staten Island.

plot1 <- nyc %>%
  ggplot(aes(x = borough, fill = level_of_offense)) +
  geom_bar(position = "dodge") + 
  guides(fill = FALSE)

plot2 <- nyc %>%
  ggplot(aes(x = borough, fill = level_of_offense)) +
  geom_bar(position = "fill")

grid.arrange(plot1, plot2, ncol = 2)

The bar chart below shows Brooklyn has the largest number of crimes among the five New York boroughs, so we focus on Brooklyn in this section.

borough.crime <- nyc %>% 
  group_by(borough) %>% 
  summarise(n_borough = n())
borough.crime %>% 
  ggplot(aes(x = borough, y = n_borough)) +
  geom_bar(stat = "identity") +
  ggtitle("Number of crime in different borough") + 
  xlab("Borough") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6)

Brooklyn

Looking at the top 10 crime type in Brooklyn, we find that petit larceny is the most common crimes.

brooklyn <- nyc %>% 
  filter(borough == "BROOKLYN") %>% select(level_of_offense,offense_classification_description,type_of_jurisdiction,type_of_location)
brooklyn %>% 
  group_by(offense_classification_description) %>% 
  summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 10) %>% 
  ggplot(aes(x = reorder(offense_classification_description, -n_class), y = n_class)) +
  geom_bar(stat = "identity") +
  ggtitle("top 10 crime in brooklyn") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) + 
  coord_flip()

When it comes to level of offense, the number of misdemeanors in Brooklyn is larger than felony and violation combined.

brooklyn %>% 
  group_by(level_of_offense) %>% 
  summarise(n_level = n()) %>% 
  ggplot(aes(x = level_of_offense,y = n_level)) + 
  geom_bar(stat = "identity") +
  ggtitle("crime in brooklyn group by different level of crime") + 
  xlab("Level of offense") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6)

If we look at jurisdiction, we’ll find that most crimes are under direct jurisdiction of NewYork police department, NewYork housing police and NewYork transit police.

brooklyn %>% 
  group_by(type_of_jurisdiction) %>% 
  summarise(n_jurisdiction = n()) %>% 
  ggplot(aes(x = type_of_jurisdiction, y = n_jurisdiction)) +
  geom_bar(stat = "identity") +
  ggtitle("crime in brooklyn group by different type of jurisdiction") + 
  xlab("type of juristiction") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) +
  coord_flip()

We plot the frequency of crime occurrence in different location. It shows that crimes in Brooklyn often happen on the street, in commercial building, and at residence house.

brooklyn %>% 
  group_by(type_of_location) %>% 
  summarise(n_location = n()) %>%
  arrange(desc(n_location)) %>% 
  head(n = 6) %>% 
  ggplot(aes(x = type_of_location, y = n_location)) + 
  geom_bar(stat = "identity") +
  ggtitle("crime in brooklyn group by different type of location") + 
  xlab("type of location") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) + 
  coord_flip()

Type and Level

We plot the top 5 crime types in New York. They are petit larceny, harrassment, assault 3 & related offenses, criminal Mischief & related of, and grand larceny.

nyc %>% 
  group_by(offense_classification_description) %>% 
  summarise(n_class = n()) %>% 
  arrange(desc(n_class)) %>% 
  head(n = 5) %>% 
  ggplot(aes(x = reorder(offense_classification_description, -n_class), y = n_class)) +
  geom_bar(stat = "identity") +
  ggtitle("Top 5 crime type in New York") + 
  xlab("offense classification") + ylab("number of crime") +
  geom_text(aes(label = ..y..), vjust = -0.6) +
  coord_flip()

We then color code the chart above with crime offense level. All grand larcenies are felonies.

top5class <- nyc %>% 
  filter(offense_classification_description %in% c("PETIT LARCENY","HARRASSMENT 2","ASSAULT 3 & RELATED OFFENSES","CRIMINAL MISCHIEF & RELATED OF","GRAND LARCENY")) %>% 
  select(offense_classification_description,level_of_offense,borough,type_of_location,occurance_month,occurance_day,occurance_weekdays,occurance_hour)

top5class %>% 
  ggplot(aes(as.factor(x = offense_classification_description), fill = level_of_offense)) +
  geom_bar() +
  ggtitle("level of offense of 5 most frequency crime") +
  xlab("type of crime") + ylab("number of crime") +
  coord_flip()

Plotting the number of occurrence for the New York top 5 crime types in the five boroughs, we find that Manhattan has a series petit larceny issue, and Staten Island has a very low grand larceny occurrence.

top5class %>% 
  ggplot(aes(as.factor(x = offense_classification_description), fill = borough)) +
  geom_bar(position = "dodge") +
  ggtitle("New York Top 5 Crime Types in boroughs") +
  xlab("type of crime") + ylab("number of crime") +
  coord_flip()

The following table shows the top 4 types of location that crimes happen a lot and number of crimes happened in 2014-2015.

loc <- nyc %>% 
  group_by(type_of_location) %>% 
  summarise(n_location = n()) %>% 
  arrange(desc(n_location)) %>% 
  head(n = 4)
kable(loc)
type_of_location n_location
STREET 295257
RESIDENCE - APT. HOUSE 207848
RESIDENCE-HOUSE 88019
RESIDENCE - PUBLIC HOUSING 72818

We create a bar chart for the table above and color code it with the top 5 crime types. Sadly, we find that residence have most harrassments 2 and assault 3 crimes.

top5class %>% 
  filter(type_of_location %in% c("STREET","RESIDENCE - APT. HOUSE","RESIDENCE-HOUSE","RESIDENCE - PUBLIC HOUSING")) %>% 
  ggplot(aes(x = type_of_location, fill = offense_classification_description)) +
  geom_bar(position = "dodge") +
  ggtitle("level of offense of 5 most frequency crime happened in top 4 frequency crime location") +
  xlab("location") + ylab("number of crime") +
  coord_flip()

According to the 4 plots below, the top 5 crime types are more likely to happen in May to October, on first day of each month, on Friday, and in the afternoon. The patten is similar with the patten of all crimes.

top5class %>%
  ggplot(aes(as.factor(x = occurance_month), fill = offense_classification_description)) +
  geom_bar() +
  ggtitle("Monthly New York Crime") +
  xlab("Month") + ylab("number of crime") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

top5class %>% 
  ggplot(aes(as.factor(x = occurance_day), fill = offense_classification_description)) +
  geom_bar() +
  ggtitle("Daily New York Crime") + 
  xlab("Day") + ylab("number of crime") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

top5class %>%
  ggplot(aes(as.factor(x = occurance_weekdays), fill = offense_classification_description)) +
  geom_bar() +
  ggtitle("Weekly New York Crime") +
  xlab("Weekdays") + ylab("number of crime") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

top5class %>% 
  ggplot(aes(as.factor(x = occurance_hour), fill = offense_classification_description)) +
  geom_bar() +
  ggtitle("Hourly New York Crime") +
  xlab("hour") + ylab("number of crime") +
  scale_fill_discrete(guide = guide_legend(title = "year"))

Interactive Maps

We plot the crimes on the map of New York City. (To reduce loading time, we randomly sample 8000 records for each map.)

Crimes occurred in 2014 and 2015

The following map shows crimes occurred in 2014. Each point represents an occurrance of crime.

nyc2014 <- nyc %>% 
  filter(occurance_year == "2014")

crime.distribution2014 <- sample_n(nyc2014, 8e3) ##8000 point

leaflet(data = crime.distribution2014) %>% 
  addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircleMarkers(~ longitude, ~latitude, radius = 0.005, color = "yellow", fillOpacity = 0.1) 

The following map shows crimes occurred in 2015. Each point represents an occurrance of crime.

nyc2015 <- nyc %>% 
  filter(occurance_year == "2015")

crime.distribution2015 <- sample_n(nyc2015, 8e3) #8000 point

leaflet(data = crime.distribution2015) %>% 
  addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircleMarkers(~ longitude, ~latitude, radius = 0.005, color = "blue", fillOpacity = 0.1) 

Date and time of occurrance

Click the numbers on the map below will zoom in the map and show more detail. If you see location pins, you should be able to mouse hover over the pin and see the occurrance date and time of the crime that occurred here. You can select which year to show at the top right hand corner of the map.

nyc.s <- sample_n(nyc,8000)   #ramdonly pick 8000 point 
nyc.s <- na.omit(nyc.s)
nyc.s.year <- split(nyc.s,nyc.s$occurance_year)  ###split by year

l <- leaflet() %>% addTiles()

names(nyc.s.year) %>%
  purrr::walk( function(year) {
    l <<- l %>%
      addMarkers(data = nyc.s.year[[year]],
                 lng = ~longitude, lat = ~latitude,
                 label = ~as.character(occurance_date_time),
                 popup = ~as.character(occurance_date_time),
                 group = year,
                 clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F),
                 labelOptions = labelOptions(noHide = F, direction = 'auto'))
  })

l %>%
  addLayersControl(
    overlayGroups = names(nyc.s.year),
    options = layersControlOptions(collapsed = FALSE)
  )

Crime type

Click the numbers on the map below will zoom in the map and show more detail. If you see location pins, you should be able to mouse hover over the pin and see the type of the crime that occurred here. You can select which level of offense to show at the top right hand corner of the map.

nyc.s <- sample_n(nyc,8000)   #ramdonly pick 8000 point 
nyc.s <- na.omit(nyc.s)
nyc.s.level <- split(nyc.s,nyc.s$level_of_offense)

l <- leaflet() %>% addTiles()

names(nyc.s.level) %>%
  purrr::walk( function(level) {
    l <<- l %>%
      addMarkers(data = nyc.s.level[[level]],
                 lng = ~longitude, lat = ~latitude,
                 label = ~as.character(offense_classification_description),
                 popup = ~as.character(offense_classification_description),
                 group = level,
                 clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F),
                 labelOptions = labelOptions(noHide = F, direction = 'auto'))
  })


l %>%
  addLayersControl(
    overlayGroups = names(nyc.s.level),
    options = layersControlOptions(collapsed = FALSE)
  )

Type of Location

Click the numbers on the map below will zoom in the map and show more detail. If you see location pins, you should be able to mouse hover over the pin and see the type of location that the crime occurred at. You can select which borough to show at the top right hand corner of the map.

nyc.s <- sample_n(nyc,8000)   #ramdonly pick 8000 point 
nyc.s <- na.omit(nyc.s)
nyc.s.borough <- split(nyc.s,nyc.s$borough)

l <- leaflet() %>% addTiles()

names(nyc.s.borough) %>%
  purrr::walk( function(borough) {
    l <<- l %>%
      addMarkers(data = nyc.s.borough[[borough]],
                 lng = ~longitude, lat = ~latitude,
                 label = ~as.character(type_of_location),
                 popup = ~as.character(type_of_location),
                 group = borough,
                 clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F),
                 labelOptions = labelOptions(noHide = F, direction = 'auto'))
  })

l %>%
  addLayersControl(
    overlayGroups = names(nyc.s.borough),
    options = layersControlOptions(collapsed = FALSE)
  )

Summary

To uncover new information, we go through and classify all of the variables and add new information to the data set. Multiple tables and charts are created to help understand the data and discover information, even knowledge. Based on our analysis, we are able to find the following findings that might be useful to most people:

Time and Trend

  • The number of crimes in New York City was is decreased from 2014 to 2015.
  • There are more crimes in summer than in winner.
  • More crimes occurred on the last few weekdays and first day of the month. Less crimes occurred at the ending of weekends and beginning of weekdays.
  • The night hours of weekends are more dangerous than weekdays’.
  • The number of crimes is generally lower in the morning.

Crime Type and Level

  • The top 3 crime types at New York City are larceny, harrassment and assault.
  • All grand larcenies are felonies.
  • A hugh number of harrassments and assaults happen in residence.

Crime Occurrence

  • It takes longer time for felonies to be reported to the police than misdemeanors and violations.
  • More than 90% of crimes can be reported within 10 days of occurrence.
  • Most crimes end within 24-hour after the beginning.
  • Overall, more serious crimes take longer time to commit.

Borough

  • Brooklyn has the largest number of crimes among the five New York boroughs.
  • Staten Island has the lowest poverty rate and number of crimes.
  • Manhattan has a series petit larceny issue.

Implications

This analyis can be used by individuals and organizations to gain an understanding of the crime situation of New York City.

Limitations

This analyis was limited by the time span of the data set and lack of data mining. With only two-year of data, our yearly and monthly analysis might be biased. We can’t analyze if there is a downward trend in crime over the years and how crime in New York City changes. We collect demographic information for the five New York City boroughs, but the sample size is too small to conduct data mining as we planned. Using crime data in the past decades with demographic information may create a data set that contains enough data for data mining about which demographic factor has impact on crime rates.