The Trump administration has made a priority of reducing violent crime. As the the nation’s largest city, New York has been described as a violent city in the press, on TV and in the movies. In this project, we analyze the crimes reported in all 4 boroughs of New York City from 2014 to 2015, and try to figure out where and when the crimes happen and what factors have influence on a borough’s crime rate.
This project involves exploring New York City Crimes dataset that contains NYPD Complaint Data from 2014 to 2015. The dataset contains crime occurence time, location, status, type and level of offenses. We also collect demographic information, such as population, unemployment rate, poverty rate, median age and median household income, of the whole New York City and its five boroughs (Bronx, Brooklyn, Manhattan, Queens, Staten Island).
We prepare and clean the dataset then conduct exploratory data analysis and visualize the data. Data visualization can show us when most crimes happen and what kind of crimes they are. We uncover some information about crimes and demographic factors of the five boroughs of New York. We use interactive maps to show how crimes vary between blocks.
This project is intended to help the public and organizations have a better understanding on the crime situation of New York City. We hope this analysis help residents decide when and where it’s safe to go out alone. Organizations can use the data to see what factors can be improved to make the community safer.
The following packages are used in order to produce results throughout this project.
library(tidyr) # used for tidying up data
library(dplyr) # used for data manipulation
library(lubridate) # used for transforming date
library(knitr) # used for viewing data
library(leaflet) # used for creating interactive maps
library(ggplot2) # used for data visualization
library(gridExtra) # used for arranging grid-based plots
We perform the following procedures to get the data ready for analysis.
The New York City Crimes data we used can be found at Kaggle in the form of csv files. The data is collected from NYC Open Data and contains reported crime to NYPD from 2014 to 2015. The following table shows column names and description. We rename columns with easier to understand names, and the new names are listed here.
| OldColumnName | NewColumnName | ColumnDescription |
|---|---|---|
| crime_id | crime_id | |
| CMPLNT_NUM | occurance_date | Randomly generated persistent ID for each complaint |
| CMPLNT_FR_DT | occurance_time | Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |
| CMPLNT_FR_TM | ending_date | Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |
| CMPLNT_TO_DT | ending_time | Ending date of occurrence for the reported event, if exact time of occurrence is unknown |
| CMPLNT_TO_TM | reported_date | Ending time of occurrence for the reported event, if exact time of occurrence is unknown |
| RPT_DT | offense_classification_code | Date event was reported to police |
| KY_CD | offense_classification_descriotion | Three digit offense classification code |
| OFNS_DESC | internal_classification_code | Description of offense corresponding with key code |
| PD_CD | internal_classification_description | Three digit internal classification code (more granular than Key Code) |
| PD_DESC | crime_status | Description of internal classification corresponding with PD code (more granular than Offense Description) |
| CRM_ATPT_CPTD_CD | level_of_offense | Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |
| LAW_CAT_CD | type_of_jurisdiction | Level of offense: felony, misdemeanor, violation |
| JURIS_DESC | borough | Jurisdiction responsible for incident. Either internal, like Police, Transit, and Housing; or external, like Correction, Port Authority, etc. |
| BORO_NM | precienct | The name of the borough in which the incident occurred |
| ADDR_PCT_CD | specific_location | The precinct in which the incident occurred |
| LOC_OF_OCCUR_DESC | type_of_location | Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |
| PREM_TYP_DESC | park_name | Specific description of premises; grocery store, residence, street, etc. |
| PARKS_NM | housing_name | Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |
| HADEVELOPT | x_coordinate | Name of NYCHA housing development of occurrence, if applicable |
| X_COORD_CD | y_coordinate | X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |
| Y_COORD_CD | latitude | Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |
| Latitude | longitude | Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |
| Longitude | location | Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |
We also collect demographic information for the five New York City boroughs from the following website and put them in a new table:
We run the following code to import the dataset downloaded from Kaggle into R. All columns are renamed into an easy to understand name. After inspecting the source data, we decide to remove the “crime_id”, “offense_classification_code”, “internal_classification_code”, “x_coordinate”, “y_coordinate” and “location” columns because they are redundant or unwanted.
nyc <- read.csv("NYPD_Complaint_Data_Historic.csv")
colnames(nyc) <- c("crime_id","occurance_date","occurance_time","ending_date","ending_time","reported_date","offense_classification_code","offense_classification_description","internal_classification_code","internal_classification_description","crime_status","level_of_offense","type_of_jurisdiction","borough","precienct","specific_location","type_of_location","park_name","housing_name","x_coordinate","y_coordinate","latitude","longitude","location")
nyc <- nyc[,-c(1,7,9,20,21,24)]
We then get and set components of “occurance_date” as “occurance_year”, “occurance_month”, “occurance_day” and “occurance_weekdays”. We also create “occurance_hour” from “occurance_time”. The duration of crime is calculated as “diff_ending.occurance”, and “diff_reported.occurance” is the time between the occurrence of crime and NYPD being reported. The dataset contains a few records before 2014, and we remove these records.
nyc <- nyc %>%
mutate(occurance_year = year(mdy(occurance_date)),
occurance_month = month(mdy(occurance_date)),
occurance_day = day(mdy(occurance_date)),
occurance_weekdays = weekdays(mdy(occurance_date)),
occurance_hour = hour(hms(occurance_time)),
diff_reported.occurance = difftime(mdy(reported_date),mdy(occurance_date),units = "day"),
occurance_date_time = as.POSIXct(paste(occurance_date,occurance_time),format = "%m/%d/%Y %H:%M:%S"),
ending_date_time = as.POSIXct(paste(ending_date,ending_time),format = "%m/%d/%Y %H:%M:%S"),
diff_ending.occurance = round(difftime(ending_date_time,occurance_date_time,units = "hours"),digits = 2),
weekends = ifelse(occurance_weekdays %in% c("Sunday","Saturday"),"Yes","No")
) %>%
filter(occurance_year == "2014" | occurance_year == "2015")
We calculate totoal number or crimes reported for each borough of New York City as “total_crime” in a new table.
table2 <- nyc %>%
group_by(borough) %>%
summarise(total_crime = length(occurance_date))
table2$total_crime <- as.numeric(table2$total_crime)
We then add the following information for each of the 5 boroughs:
total_poplutation <- rep(8175133,5)
population <- c(1385108,2504700,1585873,2230722,468730)
population_percent <- round(population/total_poplutation,digits = 4)
land_area <- c(110,180,59.1,280,152)
white <- c(386497,1072041,911073,886053,341677)
black <- c(505200,860083,246687,426683,49857)
asian <- c(50897,263519,180425,513317,35377)
other_or_mixed <- c(442514,309057,247688,404669,41819)
white_percent <- white/population
black_percent <- black/population
asian_percent <- asian/population
other_or_mixed_percent <- other_or_mixed/population
unemployment_rate <- c(0.961,0.825,0.659,0.683,0.129)
poverty_rate <- c(0.304,0.223,0.176,0.138,0.144)
median_age <- c(33.6,34.7,36.8,38.1,39.8)
median_household_income <- c(35176,51141,75575,60422,71622)
median_property_value <- c(368500,638500,867600,487400,457700)
table3 <- cbind(table2, land_area, population, population_percent, white, white_percent, black, black_percent, asian, asian_percent, other_or_mixed, other_or_mixed_percent, unemployment_rate, poverty_rate, median_age, median_household_income, median_property_value)
newrow1 <- data.frame(borough = "Whole New York",
total_crime = sum(table3$total_crime),
land_area = sum(table3$land_area),
population = sum(table3$population),
population_percent = sum(population_percent),
white = sum(table3$white),
white_percent = sum(table3$white)/total_poplutation[1],
black = sum(table3$black),
black_percent = sum(table3$black)/total_poplutation[1],
asian = sum(table3$asian),
asian_percent = sum(table3$asian)/total_poplutation[1],
other_or_mixed = sum(table3$other_or_mixed),
other_or_mixed_percent = sum(table3$other_or_mixed)/total_poplutation[1],
unemployment_rate = 0.658,
poverty_rate = 0.20,
median_age = 36,
median_household_income = 55752,median_property_value = 538300)
#borough crime analysis table
bca <- rbind(table3,newrow1)
The following table is a snapshot of the cleaned dataset that contains all crime records.
| occurance_date | occurance_time | ending_date | ending_time | reported_date | offense_classification_description | internal_classification_description | crime_status | level_of_offense | type_of_jurisdiction | borough | precienct | specific_location | type_of_location | park_name | housing_name | latitude | longitude | occurance_year | occurance_month | occurance_day | occurance_weekdays | occurance_hour | diff_reported.occurance | occurance_date_time | ending_date_time | diff_ending.occurance | weekends |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12/31/2015 | 23:45:00 | 12/31/2015 | FORGERY | FORGERY,ETC.,UNCLASSIFIED-FELO | COMPLETED | FELONY | N.Y. POLICE DEPT | BRONX | 44 | INSIDE | BAR/NIGHT CLUB | 40.82885 | -73.91666 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:45:00 | NA | NA | No | ||||
| 12/31/2015 | 23:36:00 | 12/31/2015 | MURDER & NON-NEGL. MANSLAUGHTER | COMPLETED | FELONY | N.Y. POLICE DEPT | QUEENS | 103 | OUTSIDE | 40.69734 | -73.78456 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:36:00 | NA | NA | No | ||||||
| 12/31/2015 | 23:30:00 | 12/31/2015 | DANGEROUS DRUGS | CONTROLLED SUBSTANCE,INTENT TO | COMPLETED | FELONY | N.Y. POLICE DEPT | MANHATTAN | 28 | OTHER | 40.80261 | -73.94505 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:30:00 | NA | NA | No | |||||
| 12/31/2015 | 23:30:00 | 12/31/2015 | ASSAULT 3 & RELATED OFFENSES | ASSAULT 3 | COMPLETED | MISDEMEANOR | N.Y. POLICE DEPT | QUEENS | 105 | INSIDE | RESIDENCE-HOUSE | 40.65455 | -73.72634 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:30:00 | NA | NA | No | ||||
| 12/31/2015 | 23:25:00 | 12/31/2015 | 23:30:00 | 12/31/2015 | ASSAULT 3 & RELATED OFFENSES | ASSAULT 3 | COMPLETED | MISDEMEANOR | N.Y. POLICE DEPT | MANHATTAN | 13 | FRONT OF | OTHER | 40.73800 | -73.98789 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:25:00 | 2015-12-31 23:30:00 | 0.08 hours | No | ||
| 12/31/2015 | 23:18:00 | 12/31/2015 | 23:25:00 | 12/31/2015 | FELONY ASSAULT | ASSAULT 2,1,UNCLASSIFIED | ATTEMPTED | FELONY | N.Y. POLICE DEPT | BROOKLYN | 71 | FRONT OF | DRUG STORE | 40.66502 | -73.95711 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:18:00 | 2015-12-31 23:25:00 | 0.12 hours | No | ||
| 12/31/2015 | 23:15:00 | 12/31/2015 | DANGEROUS DRUGS | CONTROLLED SUBSTANCE, POSSESSI | COMPLETED | MISDEMEANOR | N.Y. POLICE DEPT | MANHATTAN | 7 | OPPOSITE OF | STREET | 40.72020 | -73.98874 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:15:00 | NA | NA | No | ||||
| 12/31/2015 | 23:15:00 | 12/31/2015 | 23:15:00 | 12/31/2015 | DANGEROUS WEAPONS | WEAPONS POSSESSION 1 & 2 | COMPLETED | FELONY | N.Y. POLICE DEPT | BRONX | 46 | FRONT OF | STREET | 40.84571 | -73.91040 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:15:00 | 2015-12-31 23:15:00 | 0.00 hours | No | ||
| 12/31/2015 | 23:15:00 | 12/31/2015 | 23:30:00 | 12/31/2015 | ASSAULT 3 & RELATED OFFENSES | ASSAULT 3 | COMPLETED | MISDEMEANOR | N.Y. POLICE DEPT | BRONX | 48 | INSIDE | RESIDENCE - APT. HOUSE | 40.85671 | -73.89190 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:15:00 | 2015-12-31 23:30:00 | 0.25 hours | No | ||
| 12/31/2015 | 23:10:00 | 12/31/2015 | 23:10:00 | 12/31/2015 | PETIT LARCENY | LARCENY,PETIT FROM BUILDING,UN | COMPLETED | MISDEMEANOR | N.Y. POLICE DEPT | MANHATTAN | 19 | INSIDE | DRUG STORE | 40.76562 | -73.96362 | 2015 | 12 | 31 | Thursday | 23 | 0 days | 2015-12-31 23:10:00 | 2015-12-31 23:10:00 | 0.00 hours | No |
The following table is the dataset that contains demographic information of the five New York City boroughs.
| borough | total_crime | land_area | population | population_percent | white | white_percent | black | black_percent | asian | asian_percent | other_or_mixed | other_or_mixed_percent | unemployment_rate | poverty_rate | median_age | median_household_income | median_property_value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BRONX | 208966 | 110.0 | 1385108 | 0.1694 | 386497 | 0.2790374 | 505200 | 0.3647369 | 50897 | 0.0367459 | 442514 | 0.3194798 | 0.961 | 0.304 | 33.6 | 35176 | 368500 |
| BROOKLYN | 288804 | 180.0 | 2504700 | 0.3064 | 1072041 | 0.4280117 | 860083 | 0.3433876 | 263519 | 0.1052098 | 309057 | 0.1233908 | 0.825 | 0.223 | 34.7 | 51141 | 638500 |
| MANHATTAN | 223676 | 59.1 | 1585873 | 0.1940 | 911073 | 0.5744930 | 246687 | 0.1555528 | 180425 | 0.1137701 | 247688 | 0.1561840 | 0.659 | 0.176 | 36.8 | 75575 | 867600 |
| QUEENS | 193068 | 280.0 | 2230722 | 0.2729 | 886053 | 0.3972046 | 426683 | 0.1912757 | 513317 | 0.2301125 | 404669 | 0.1814072 | 0.683 | 0.138 | 38.1 | 60422 | 487400 |
| STATEN ISLAND | 44425 | 152.0 | 468730 | 0.0573 | 341677 | 0.7289420 | 49857 | 0.1063661 | 35377 | 0.0754742 | 41819 | 0.0892177 | 0.129 | 0.144 | 39.8 | 71622 | 457700 |
| Whole New York | 958939 | 781.1 | 8175133 | 1.0000 | 3597341 | 0.4400346 | 2088510 | 0.2554711 | 1043535 | 0.1276475 | 1445747 | 0.1768469 | 0.658 | 0.200 | 36.0 | 55752 | 538300 |
We create tables, charts and maps to explore the New York Crimes dataset.
The number of crimes in 2015 is decreased by 20,000 than 2014.
nyc %>%
ggplot(aes(as.character(x = occurance_year))) +
geom_bar() +
ggtitle("New York Crime in 2014 & 2015") + xlab("year") + ylab("number of crime") +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.6)
By looking at the number of crimes occored each month from 2014 to 2015, we find that August has the highest number of crimes and February has the lowest. There are more crimes in summer (from May to October) than in winner (from November to next April). All 2015 months but December have slightly lower number of crimes than 2014 months. Overall, the difference between 2014 and 2015 is small.
plot1 <- nyc %>%
ggplot(aes(as.factor(x = occurance_month))) +
geom_bar() +
ggtitle("Monthly New York Crime") + xlab("Month") + ylab("number of crime")
plot2 <- nyc %>%
ggplot(aes(as.factor(x = occurance_month),fill = as.factor(occurance_year))) +
geom_bar(position = "fill") +
xlab("Month") + ylab("proportion") +
scale_fill_discrete(guide = guide_legend(title = "year"))
grid.arrange(plot1, plot2, ncol = 2)
We plot the number of crimes in each weekday. The following graph shows that Friday and Wednesday have more crimes than other days of the week, and Sunday and Monday have less. Even criminals don’t want to work on Sunday and have difficulties starting to “work” on Monday.
nyc$occurance_weekdays <- factor(nyc$occurance_weekdays, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
plot1 <- nyc %>%
ggplot(aes(as.factor(x = occurance_weekdays))) +
geom_bar() +
ggtitle("New York Crime - weekdays") + xlab("Weekdays") + ylab("number of crime") +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.6)
plot2 <- nyc %>%
ggplot(aes(as.factor(x = occurance_weekdays),fill = as.factor(occurance_year))) +
geom_bar(position = "fill") +
xlab("weekdays") + ylab("proportion") +
scale_fill_discrete(guide = guide_legend(title = "year"))
grid.arrange(plot1, plot2, ncol = 2)
We all know the night hours are dangerous. The following graph tells us that the night hours on weekends are more dangerous than weekdays’. The good news is that Saturday and Sunday have safer daytime than Monday to Friday.
nyc %>%
ggplot(aes(x = occurance_hour, fill = weekends)) +
geom_density(alpha = 0.8)
The following plot shows the number of crimes is much higher in the afternoon and evening than the morning. There are the fewest crimes around the sunrise time.
nyc %>%
ggplot(aes(as.factor(x = occurance_hour))) +
geom_bar() +
ggtitle("New York Crime - hour") + xlab("hour") + ylab("number of crime") +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.6)
The following plot shows the number of days between crime occurrence date and the date event was reported to police. Almost of crimes in New York can be reported within 10 days after the occurrence. We also find that a large number of crimes are finishing reporting around 183 days or 360 days after their occurrence.
plot1 <- nyc %>%
ggplot(aes(diff_reported.occurance)) +
geom_histogram(binwidth = 1) +
ggtitle("Days between reported and occurrence time") + xlab("<=10 days") +
scale_x_continuous(limits = c(-1,10))
diff1 <- nyc %>%
filter(diff_reported.occurance > 10) %>%
select(level_of_offense, offense_classification_description, borough, diff_reported.occurance)
plot2 <- diff1 %>%
ggplot(aes(diff_reported.occurance)) +
geom_histogram(binwidth = 1) +
ggtitle("") + xlab(">10 days")
grid.arrange(plot1, plot2, ncol = 2)
We decide to use 183-day as a demarcation point. The follow plots analyze the difference between crimes that reported 10 to 183 days after occurrence and crimes reported more than 183 days after occurrence. The following plot compares the crimes’ levels of offense. We find that in general, the larger the time difference between crime occurrence date and reporting data, the more severe impairment the crime caused. When it comes to the specific offense level, most violations can be reported within 10 days, and misdemeanors can be reported within half-year. It takes longer time for felonies to be reported.
# <10days
diff1.short <- nyc %>%
filter(diff_reported.occurance <= 10) %>%
select(level_of_offense,offense_classification_description,borough,diff_reported.occurance)
diff1.short.borough <- diff1.short %>%
group_by(borough) %>%
summarise(n_borough = n())
borough.crime <- nyc %>%
group_by(borough) %>%
summarise(n_borough = n())
borough.crime.short <- cbind(diff1.short.borough,borough.crime)[,-3]
colnames(borough.crime.short) <- c("borough","short.reported","total")
borough.crime.short <- borough.crime.short %>%
mutate(short.percent = short.reported/total)
within10.borough <- borough.crime.short %>%
ggplot(aes(x = borough, y = short.percent)) +
geom_bar(stat = "identity" ) +
ggtitle("time between occurrence and reporting <=10 (Borough)") +
xlab("Borough") + ylab("report percent of whole")
# 10<days<183
diff1.mid <- nyc %>%
filter(diff_reported.occurance > 10 & diff_reported.occurance < 183) %>%
select(level_of_offense,offense_classification_description,borough,diff_reported.occurance)
half.level <- diff1.mid %>%
group_by(level_of_offense) %>%
summarise(n_level = n()) %>%
ggplot(aes(x = level_of_offense, y = n_level)) + geom_bar(stat = "identity") +
ggtitle("10< time between occurrence and reporting <183 (Level of crime)") +
xlab("Level of offense") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.1)
half.description <- diff1.mid %>%
group_by(offense_classification_description) %>%
summarise(n_class = n()) %>%
arrange(desc(n_class)) %>%
head(n = 10) %>%
ggplot(aes(x = reorder(offense_classification_description,-n_class),y = n_class)) +
geom_bar(stat = "identity") +
ggtitle("10< time between occurrence and reporting <183 (offense classification - top 10)") +
xlab("offense classification") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6) +
coord_flip()
diff1.mid.borough <- diff1.mid %>%
group_by(borough) %>%
summarise(n_borough = n())
borough.crime <- nyc %>%
group_by(borough) %>%
summarise(n_borough = n())
borough.crime.mid <- cbind(diff1.mid.borough,borough.crime)[,-3]
colnames(borough.crime.mid) <- c("borough","mid.reported","total")
borough.crime.mid <- borough.crime.mid %>%
mutate(mid.percent = mid.reported/total)
half.borough <- borough.crime.mid %>%
ggplot(aes(x = borough, y = mid.percent)) +
geom_bar(stat = "identity") +
ggtitle("10< difference between report time and occurrence time <183 (Borough)") +
xlab("Borough") + ylab("report percent of whole")
# days>=183
diff1.long <- nyc %>%
filter(diff_reported.occurance >= 183) %>%
select(level_of_offense,offense_classification_description,borough,diff_reported.occurance)
whole.level <- diff1.long %>%
group_by(level_of_offense) %>%
summarise(n_level = n()) %>%
ggplot(aes(x = level_of_offense, y = n_level)) +
geom_bar(stat = "identity") +
ggtitle("time between occurrence and reporting >=183 (Level of crime)") +
xlab("Level of offense") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.1)
whole.description <- diff1.long %>%
group_by(offense_classification_description) %>%
summarise(n_class = n()) %>%
arrange(desc(n_class)) %>%
head(n = 10) %>%
ggplot(aes(x = reorder(offense_classification_description,-n_class), y = n_class)) +
geom_bar(stat = "identity") +
ggtitle("time between occurrence and reporting >=183 (offense classification - top 10)") +
xlab("offense classification") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6) +
coord_flip()
diff1.long.borough <- diff1.long %>%
group_by(borough) %>%
summarise(n_borough = n())
borough.crime <- nyc %>%
group_by(borough) %>%
summarise(n_borough = n())
borough.crime.long <- cbind(diff1.long.borough,borough.crime)[,-3]
colnames(borough.crime.long) <- c("borough","long.reported","total")
borough.crime.long <- borough.crime.long %>%
mutate(long.percent = long.reported/total)
whole.borough <- borough.crime.long %>%
ggplot(aes(x = borough,y = long.percent)) +
geom_bar(stat = "identity") +
ggtitle("time between occurrence and reporting >=183 (Borough)") +
xlab("Borough") + ylab("report percent of whole ")
# comparison
grid.arrange(half.level,whole.level)
We then analyze the difference in crime classification. The crimes that reported 10 to 183 days after occurrence have similar classification of proportion with the crimes that reported more than 183 days after occurrence.
grid.arrange(half.description,whole.description)
We also find that all five boroughs of New York have more than 90% crimes being reported within 10 days of occurrence.
grid.arrange(within10.borough,half.borough,whole.borough)
Most of the crimes are finished within one day after occurrence.
nyc %>% ggplot(aes(diff_ending.occurance)) +
geom_histogram() +
scale_x_continuous(limits = c(-10,100))
We decide to take a closer look at the offense levels of crimes that have more than 24 hours between occurrence and ending. The plot below shows there are more misdemeanors than felonies, and there are much more felonies than violations. Overall, more serious crimes take longer time to commit.
diff2 <- nyc %>%
filter(diff_ending.occurance > 24) %>%
select(level_of_offense,offense_classification_description,borough,diff_ending.occurance)
diff2 %>%
group_by(level_of_offense) %>%
summarise(n_level = n()) %>%
ggplot(aes(x = level_of_offense,y = n_level)) + geom_bar(stat = "identity") +
ggtitle("difference between ending and occurance time >1 day (Level of crime)") +
xlab("Level of offense") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6)
The following bar chart shows the classification of crimes that have more than 24 hours between occurrence and ending.
diff2 %>%
group_by(offense_classification_description) %>% summarise(n_class = n()) %>%
arrange(desc(n_class)) %>%
head(n = 10) %>%
ggplot(aes(x = reorder(offense_classification_description, -n_class), y = n_class)) +
geom_bar(stat = "identity") +
ggtitle("difference between ending and occurance time >1 day (offense classification - top 10)") +
xlab("offense classification") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6) +
coord_flip()
We also find that among the five boroughs of New York, Brooklyn has the largest number of crimes that last for more than 24 hours.
diff2.borough <- diff2 %>%
group_by(borough) %>%
summarise(n_borough = n())
diff2.borough %>%
ggplot(aes(x = borough,y = n_borough)) +
geom_bar(stat = "identity") +
ggtitle("difference between ending and occurrence time>1 day (Borough)") +
xlab("Borough") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6)
The plot below shows the proportion of races for the five boroughs of New York City.
bca %>%
gather(key,value, white_percent, black_percent, asian_percent, other_or_mixed_percent) %>%
ggplot(aes(x = borough, y = value, fill = key)) +
geom_bar(position = "fill", stat = "identity") +
ggtitle("Proportion of races (Borough)") +
xlab("Borough") + ylab("proportion")
We plot the median age of the five boroughs in the following bar chart. Staten Island has the highest median age among the five.
bca %>%
ggplot(aes(x = borough, y = median_age)) +
geom_bar(stat = "identity")
By plotting the unemployment rate and poverty rate on the same graph, we find that even though Staten Island has the lowest unemployment rate, its poverty rate isn’t he lowest. Queens has the lowest poverty rate.
bca %>%
gather(key,value, unemployment_rate, poverty_rate) %>%
ggplot(aes(x = borough, y = value, col = key)) +
geom_point(size = 5, alpha = 0.8)
When it comes to median household income and median property value, we can see that in general people in manhattan have higher income and more assets.
ggplot(bca, aes(x = median_household_income, y = median_property_value, col = borough)) +
geom_point(size = 5)
We compare the crimes bewteen five boroughs of New York City. The plot below shows the number of crimes of each offense level and the proportion for the five boroughs. Felonies have a higher proportion in Queens compared to the proportion in other borough, and violations have a higher proportion in Staten Island.
plot1 <- nyc %>%
ggplot(aes(x = borough, fill = level_of_offense)) +
geom_bar(position = "dodge") +
guides(fill = FALSE)
plot2 <- nyc %>%
ggplot(aes(x = borough, fill = level_of_offense)) +
geom_bar(position = "fill")
grid.arrange(plot1, plot2, ncol = 2)
The bar chart below shows Brooklyn has the largest number of crimes among the five New York boroughs, so we focus on Brooklyn in this section.
borough.crime <- nyc %>%
group_by(borough) %>%
summarise(n_borough = n())
borough.crime %>%
ggplot(aes(x = borough, y = n_borough)) +
geom_bar(stat = "identity") +
ggtitle("Number of crime in different borough") +
xlab("Borough") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6)
Looking at the top 10 crime type in Brooklyn, we find that petit larceny is the most common crimes.
brooklyn <- nyc %>%
filter(borough == "BROOKLYN") %>% select(level_of_offense,offense_classification_description,type_of_jurisdiction,type_of_location)
brooklyn %>%
group_by(offense_classification_description) %>%
summarise(n_class = n()) %>%
arrange(desc(n_class)) %>%
head(n = 10) %>%
ggplot(aes(x = reorder(offense_classification_description, -n_class), y = n_class)) +
geom_bar(stat = "identity") +
ggtitle("top 10 crime in brooklyn") +
xlab("offense classification") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6) +
coord_flip()
When it comes to level of offense, the number of misdemeanors in Brooklyn is larger than felony and violation combined.
brooklyn %>%
group_by(level_of_offense) %>%
summarise(n_level = n()) %>%
ggplot(aes(x = level_of_offense,y = n_level)) +
geom_bar(stat = "identity") +
ggtitle("crime in brooklyn group by different level of crime") +
xlab("Level of offense") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6)
If we look at jurisdiction, we’ll find that most crimes are under direct jurisdiction of NewYork police department, NewYork housing police and NewYork transit police.
brooklyn %>%
group_by(type_of_jurisdiction) %>%
summarise(n_jurisdiction = n()) %>%
ggplot(aes(x = type_of_jurisdiction, y = n_jurisdiction)) +
geom_bar(stat = "identity") +
ggtitle("crime in brooklyn group by different type of jurisdiction") +
xlab("type of juristiction") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6) +
coord_flip()
We plot the frequency of crime occurrence in different location. It shows that crimes in Brooklyn often happen on the street, in commercial building, and at residence house.
brooklyn %>%
group_by(type_of_location) %>%
summarise(n_location = n()) %>%
arrange(desc(n_location)) %>%
head(n = 6) %>%
ggplot(aes(x = type_of_location, y = n_location)) +
geom_bar(stat = "identity") +
ggtitle("crime in brooklyn group by different type of location") +
xlab("type of location") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6) +
coord_flip()
We plot the top 5 crime types in New York. They are petit larceny, harrassment, assault 3 & related offenses, criminal Mischief & related of, and grand larceny.
nyc %>%
group_by(offense_classification_description) %>%
summarise(n_class = n()) %>%
arrange(desc(n_class)) %>%
head(n = 5) %>%
ggplot(aes(x = reorder(offense_classification_description, -n_class), y = n_class)) +
geom_bar(stat = "identity") +
ggtitle("Top 5 crime type in New York") +
xlab("offense classification") + ylab("number of crime") +
geom_text(aes(label = ..y..), vjust = -0.6) +
coord_flip()
We then color code the chart above with crime offense level. All grand larcenies are felonies.
top5class <- nyc %>%
filter(offense_classification_description %in% c("PETIT LARCENY","HARRASSMENT 2","ASSAULT 3 & RELATED OFFENSES","CRIMINAL MISCHIEF & RELATED OF","GRAND LARCENY")) %>%
select(offense_classification_description,level_of_offense,borough,type_of_location,occurance_month,occurance_day,occurance_weekdays,occurance_hour)
top5class %>%
ggplot(aes(as.factor(x = offense_classification_description), fill = level_of_offense)) +
geom_bar() +
ggtitle("level of offense of 5 most frequency crime") +
xlab("type of crime") + ylab("number of crime") +
coord_flip()
Plotting the number of occurrence for the New York top 5 crime types in the five boroughs, we find that Manhattan has a series petit larceny issue, and Staten Island has a very low grand larceny occurrence.
top5class %>%
ggplot(aes(as.factor(x = offense_classification_description), fill = borough)) +
geom_bar(position = "dodge") +
ggtitle("New York Top 5 Crime Types in boroughs") +
xlab("type of crime") + ylab("number of crime") +
coord_flip()
The following table shows the top 4 types of location that crimes happen a lot and number of crimes happened in 2014-2015.
loc <- nyc %>%
group_by(type_of_location) %>%
summarise(n_location = n()) %>%
arrange(desc(n_location)) %>%
head(n = 4)
kable(loc)
| type_of_location | n_location |
|---|---|
| STREET | 295257 |
| RESIDENCE - APT. HOUSE | 207848 |
| RESIDENCE-HOUSE | 88019 |
| RESIDENCE - PUBLIC HOUSING | 72818 |
We create a bar chart for the table above and color code it with the top 5 crime types. Sadly, we find that residence have most harrassments 2 and assault 3 crimes.
top5class %>%
filter(type_of_location %in% c("STREET","RESIDENCE - APT. HOUSE","RESIDENCE-HOUSE","RESIDENCE - PUBLIC HOUSING")) %>%
ggplot(aes(x = type_of_location, fill = offense_classification_description)) +
geom_bar(position = "dodge") +
ggtitle("level of offense of 5 most frequency crime happened in top 4 frequency crime location") +
xlab("location") + ylab("number of crime") +
coord_flip()
According to the 4 plots below, the top 5 crime types are more likely to happen in May to October, on first day of each month, on Friday, and in the afternoon. The patten is similar with the patten of all crimes.
top5class %>%
ggplot(aes(as.factor(x = occurance_month), fill = offense_classification_description)) +
geom_bar() +
ggtitle("Monthly New York Crime") +
xlab("Month") + ylab("number of crime") +
scale_fill_discrete(guide = guide_legend(title = "year"))
top5class %>%
ggplot(aes(as.factor(x = occurance_day), fill = offense_classification_description)) +
geom_bar() +
ggtitle("Daily New York Crime") +
xlab("Day") + ylab("number of crime") +
scale_fill_discrete(guide = guide_legend(title = "year"))
top5class %>%
ggplot(aes(as.factor(x = occurance_weekdays), fill = offense_classification_description)) +
geom_bar() +
ggtitle("Weekly New York Crime") +
xlab("Weekdays") + ylab("number of crime") +
scale_fill_discrete(guide = guide_legend(title = "year"))
top5class %>%
ggplot(aes(as.factor(x = occurance_hour), fill = offense_classification_description)) +
geom_bar() +
ggtitle("Hourly New York Crime") +
xlab("hour") + ylab("number of crime") +
scale_fill_discrete(guide = guide_legend(title = "year"))
We plot the crimes on the map of New York City. (To reduce loading time, we randomly sample 8000 records for each map.)
The following map shows crimes occurred in 2014. Each point represents an occurrance of crime.
nyc2014 <- nyc %>%
filter(occurance_year == "2014")
crime.distribution2014 <- sample_n(nyc2014, 8e3) ##8000 point
leaflet(data = crime.distribution2014) %>%
addProviderTiles("Esri.NatGeoWorldMap") %>%
addCircleMarkers(~ longitude, ~latitude, radius = 0.005, color = "yellow", fillOpacity = 0.1)
The following map shows crimes occurred in 2015. Each point represents an occurrance of crime.
nyc2015 <- nyc %>%
filter(occurance_year == "2015")
crime.distribution2015 <- sample_n(nyc2015, 8e3) #8000 point
leaflet(data = crime.distribution2015) %>%
addProviderTiles("Esri.NatGeoWorldMap") %>%
addCircleMarkers(~ longitude, ~latitude, radius = 0.005, color = "blue", fillOpacity = 0.1)
Click the numbers on the map below will zoom in the map and show more detail. If you see location pins, you should be able to mouse hover over the pin and see the occurrance date and time of the crime that occurred here. You can select which year to show at the top right hand corner of the map.
nyc.s <- sample_n(nyc,8000) #ramdonly pick 8000 point
nyc.s <- na.omit(nyc.s)
nyc.s.year <- split(nyc.s,nyc.s$occurance_year) ###split by year
l <- leaflet() %>% addTiles()
names(nyc.s.year) %>%
purrr::walk( function(year) {
l <<- l %>%
addMarkers(data = nyc.s.year[[year]],
lng = ~longitude, lat = ~latitude,
label = ~as.character(occurance_date_time),
popup = ~as.character(occurance_date_time),
group = year,
clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F),
labelOptions = labelOptions(noHide = F, direction = 'auto'))
})
l %>%
addLayersControl(
overlayGroups = names(nyc.s.year),
options = layersControlOptions(collapsed = FALSE)
)
Click the numbers on the map below will zoom in the map and show more detail. If you see location pins, you should be able to mouse hover over the pin and see the type of the crime that occurred here. You can select which level of offense to show at the top right hand corner of the map.
nyc.s <- sample_n(nyc,8000) #ramdonly pick 8000 point
nyc.s <- na.omit(nyc.s)
nyc.s.level <- split(nyc.s,nyc.s$level_of_offense)
l <- leaflet() %>% addTiles()
names(nyc.s.level) %>%
purrr::walk( function(level) {
l <<- l %>%
addMarkers(data = nyc.s.level[[level]],
lng = ~longitude, lat = ~latitude,
label = ~as.character(offense_classification_description),
popup = ~as.character(offense_classification_description),
group = level,
clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F),
labelOptions = labelOptions(noHide = F, direction = 'auto'))
})
l %>%
addLayersControl(
overlayGroups = names(nyc.s.level),
options = layersControlOptions(collapsed = FALSE)
)
Click the numbers on the map below will zoom in the map and show more detail. If you see location pins, you should be able to mouse hover over the pin and see the type of location that the crime occurred at. You can select which borough to show at the top right hand corner of the map.
nyc.s <- sample_n(nyc,8000) #ramdonly pick 8000 point
nyc.s <- na.omit(nyc.s)
nyc.s.borough <- split(nyc.s,nyc.s$borough)
l <- leaflet() %>% addTiles()
names(nyc.s.borough) %>%
purrr::walk( function(borough) {
l <<- l %>%
addMarkers(data = nyc.s.borough[[borough]],
lng = ~longitude, lat = ~latitude,
label = ~as.character(type_of_location),
popup = ~as.character(type_of_location),
group = borough,
clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F),
labelOptions = labelOptions(noHide = F, direction = 'auto'))
})
l %>%
addLayersControl(
overlayGroups = names(nyc.s.borough),
options = layersControlOptions(collapsed = FALSE)
)
To uncover new information, we go through and classify all of the variables and add new information to the data set. Multiple tables and charts are created to help understand the data and discover information, even knowledge. Based on our analysis, we are able to find the following findings that might be useful to most people:
This analyis can be used by individuals and organizations to gain an understanding of the crime situation of New York City.
This analyis was limited by the time span of the data set and lack of data mining. With only two-year of data, our yearly and monthly analysis might be biased. We can’t analyze if there is a downward trend in crime over the years and how crime in New York City changes. We collect demographic information for the five New York City boroughs, but the sample size is too small to conduct data mining as we planned. Using crime data in the past decades with demographic information may create a data set that contains enough data for data mining about which demographic factor has impact on crime rates.