df <- read.csv('nyccrime.csv', skip = 4, as.is = TRUE)
head(df[1:8], 10)
## PARK BOROUGH SIZE..ACRES.
## 1
## 2 PELHAM BAY PARK BRONX 2771.75
## 3 VAN CORTLANDT PARK BRONX 1146.43
## 4 ROCKAWAY BEACH AND BOARDWALK QUEENS 1072.56
## 5 FRESHKILLS PARK STATEN ISLAND 913.32
## 6 FLUSHING MEADOWS CORONA PARK QUEENS 897.69
## 7 LATOURETTE PARK & GOLF COURSE STATEN ISLAND 843.97
## 8 MARINE PARK BROOKLYN 798
## 9 BELT PARKWAY/SHORE PARKWAY BROOKLYN/QUEENS 760.43
## 10 BRONX PARK BRONX 718.37
## CATEGORY MURDER RAPE ROBBERY FELONY.ASSAULT
## 1
## 2 ONE ACRE OR LARGER 0 0 0 0
## 3 ONE ACRE OR LARGER 0 0 0 0
## 4 ONE ACRE OR LARGER 0 0 0 0
## 5 ONE ACRE OR LARGER 0 0 0 0
## 6 ONE ACRE OR LARGER 0 0 4 1
## 7 ONE ACRE OR LARGER 0 0 0 0
## 8 ONE ACRE OR LARGER 0 0 0 0
## 9 ONE ACRE OR LARGER 0 0 0 0
## 10 ONE ACRE OR LARGER 0 0 4 1
Here, the types of crime are spread into multiple columns. Also we have an extra column and some incorrect column names.
library(tidyverse)
df <- df[-1, -(ncol(df)-1):-ncol(df)]
names(df)[ncol(df)] <- 'GRAND.LARCENY.OF.MOTOR.VEHICLE'
df <- pivot_longer(df, 5:ncol(df), names_to = 'crime_type', values_to = 'count')
names(df) <- sapply(names(df), tolower)
names(df)[3] <- 'size'
df[,c(3, 6)] <- sapply(df[,c(3,6)], as.numeric)
head(df, 10)
## # A tibble: 10 x 6
## park borough size category crime_type count
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 PELHAM BAY PARK BRONX 2772. ONE ACRE OR LAR~ MURDER 0
## 2 PELHAM BAY PARK BRONX 2772. ONE ACRE OR LAR~ RAPE 0
## 3 PELHAM BAY PARK BRONX 2772. ONE ACRE OR LAR~ ROBBERY 0
## 4 PELHAM BAY PARK BRONX 2772. ONE ACRE OR LAR~ FELONY.ASSAULT 0
## 5 PELHAM BAY PARK BRONX 2772. ONE ACRE OR LAR~ BURGLARY 0
## 6 PELHAM BAY PARK BRONX 2772. ONE ACRE OR LAR~ GRAND.LARCENY 0
## 7 PELHAM BAY PARK BRONX 2772. ONE ACRE OR LAR~ GRAND.LARCENY.OF.MOTOR.~ 0
## 8 VAN CORTLANDT ~ BRONX 1146. ONE ACRE OR LAR~ MURDER 0
## 9 VAN CORTLANDT ~ BRONX 1146. ONE ACRE OR LAR~ RAPE 0
## 10 VAN CORTLANDT ~ BRONX 1146. ONE ACRE OR LAR~ ROBBERY 0
With a clean dataset we can now begin our analysis. The first question I wonder is does the size of a park effect the amount of crime in said park? I put the x axis on a log scale as there were some parks with area much greater than the median.
parksizecount <- df %>% group_by(park) %>% summarize(size = mean(size, na.rm = TRUE), count = sum(count, na.rm = TRUE))
parksizecount %>% ggplot(aes(log(size), count)) + geom_point() + ylim(0, 25)
## Warning: Removed 4 rows containing missing values (geom_point).
It appears there is no relationship. How about by borough? First let me find the total amount of crimes in each borough.
boroughparktotals <- df %>% group_by(park, borough) %>% summarize(count = sum(count, na.rm = TRUE))
counts <- boroughparktotals %>% group_by(borough) %>% summarize(count = sum(count))
counts[3:nrow(counts),] %>% ggplot(aes(reorder(borough, desc(count)), count, fill = borough)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90)) +
xlab('Borough') + ylab('Number of Crimes')
First let me find the percentage of parks with any crime in each borough.
props <- boroughparktotals %>% group_by(borough) %>% summarize(prop = mean(count > 0))
props[3:nrow(props),] %>% ggplot(aes(reorder(borough, desc(prop)), prop, fill = borough)) +
geom_bar(stat = 'identity') + theme(axis.text.x = element_text(angle = 90)) + xlab('Borough') +
ylab('Proportion')
It appears that of the 5 distinct boroughs, Manhattan has the highest proportions of parks seeing any crime. However, this has the potential to be misleading as Manhattan also tends to have the largest police presence and therefore is more likely to report crime. There appears to be a large proportion of parks on the border of Brooklyn and Queens, but since there are only 5 parks in this category, the proportion ends up being 1/5, not terribly significant.
cpp <- boroughparktotals %>% group_by(borough) %>% summarize(mean = mean(count))
cpp[3:nrow(cpp),] %>% ggplot(aes(reorder(borough, desc(mean)), mean, fill = borough)) +
geom_bar(stat = 'identity') + theme(axis.text.x = element_text(angle = 90)) + xlab('Borough') +
ylab('Mean Number of Crimes/Park')
Manhattan also has the highest mean crime per park, although it is still overall rather quite small at a measely .4 crimes per park. Once again, this is possibly attributeable to larger police presence and therefore larger amounts of crimes caught rather than an absolute larger amount of crime.