NYC Park Crime

df <- read.csv('nyccrime.csv', skip = 4, as.is = TRUE)
head(df[1:8], 10)
##                             PARK         BOROUGH SIZE..ACRES.
## 1                                                            
## 2                PELHAM BAY PARK           BRONX      2771.75
## 3             VAN CORTLANDT PARK           BRONX      1146.43
## 4   ROCKAWAY BEACH AND BOARDWALK          QUEENS      1072.56
## 5                FRESHKILLS PARK   STATEN ISLAND       913.32
## 6   FLUSHING MEADOWS CORONA PARK          QUEENS       897.69
## 7  LATOURETTE PARK & GOLF COURSE   STATEN ISLAND       843.97
## 8                    MARINE PARK        BROOKLYN          798
## 9     BELT PARKWAY/SHORE PARKWAY BROOKLYN/QUEENS       760.43
## 10                    BRONX PARK           BRONX       718.37
##              CATEGORY MURDER RAPE ROBBERY FELONY.ASSAULT
## 1                                                       
## 2  ONE ACRE OR LARGER      0    0       0              0
## 3  ONE ACRE OR LARGER      0    0       0              0
## 4  ONE ACRE OR LARGER      0    0       0              0
## 5  ONE ACRE OR LARGER      0    0       0              0
## 6  ONE ACRE OR LARGER      0    0       4              1
## 7  ONE ACRE OR LARGER      0    0       0              0
## 8  ONE ACRE OR LARGER      0    0       0              0
## 9  ONE ACRE OR LARGER      0    0       0              0
## 10 ONE ACRE OR LARGER      0    0       4              1

Here, the types of crime are spread into multiple columns. Also we have an extra column and some incorrect column names.

library(tidyverse)
df <- df[-1, -(ncol(df)-1):-ncol(df)]
names(df)[ncol(df)] <- 'GRAND.LARCENY.OF.MOTOR.VEHICLE'
df <- pivot_longer(df, 5:ncol(df), names_to = 'crime_type', values_to = 'count')
names(df) <- sapply(names(df), tolower)
names(df)[3] <- 'size'
df[,c(3, 6)] <- sapply(df[,c(3,6)], as.numeric)
head(df, 10)
## # A tibble: 10 x 6
##    park            borough  size category         crime_type               count
##    <chr>           <chr>   <dbl> <chr>            <chr>                    <dbl>
##  1 PELHAM BAY PARK BRONX   2772. ONE ACRE OR LAR~ MURDER                       0
##  2 PELHAM BAY PARK BRONX   2772. ONE ACRE OR LAR~ RAPE                         0
##  3 PELHAM BAY PARK BRONX   2772. ONE ACRE OR LAR~ ROBBERY                      0
##  4 PELHAM BAY PARK BRONX   2772. ONE ACRE OR LAR~ FELONY.ASSAULT               0
##  5 PELHAM BAY PARK BRONX   2772. ONE ACRE OR LAR~ BURGLARY                     0
##  6 PELHAM BAY PARK BRONX   2772. ONE ACRE OR LAR~ GRAND.LARCENY                0
##  7 PELHAM BAY PARK BRONX   2772. ONE ACRE OR LAR~ GRAND.LARCENY.OF.MOTOR.~     0
##  8 VAN CORTLANDT ~ BRONX   1146. ONE ACRE OR LAR~ MURDER                       0
##  9 VAN CORTLANDT ~ BRONX   1146. ONE ACRE OR LAR~ RAPE                         0
## 10 VAN CORTLANDT ~ BRONX   1146. ONE ACRE OR LAR~ ROBBERY                      0

With a clean dataset we can now begin our analysis. The first question I wonder is does the size of a park effect the amount of crime in said park? I put the x axis on a log scale as there were some parks with area much greater than the median.

parksizecount <- df %>% group_by(park) %>% summarize(size = mean(size, na.rm = TRUE), count = sum(count, na.rm = TRUE))
parksizecount %>% ggplot(aes(log(size), count)) + geom_point() + ylim(0, 25)
## Warning: Removed 4 rows containing missing values (geom_point).

It appears there is no relationship. How about by borough? First let me find the total amount of crimes in each borough.

boroughparktotals <- df %>% group_by(park, borough) %>% summarize(count = sum(count, na.rm = TRUE))
counts <- boroughparktotals %>% group_by(borough) %>% summarize(count = sum(count))
counts[3:nrow(counts),] %>% ggplot(aes(reorder(borough, desc(count)), count, fill = borough)) + 
                           geom_bar(stat = 'identity') +
                           theme(axis.text.x = element_text(angle = 90)) + 
                           xlab('Borough') + ylab('Number of Crimes')

First let me find the percentage of parks with any crime in each borough.

props <- boroughparktotals %>% group_by(borough) %>% summarize(prop = mean(count > 0))
props[3:nrow(props),] %>% ggplot(aes(reorder(borough, desc(prop)), prop, fill = borough)) + 
          geom_bar(stat = 'identity') + theme(axis.text.x = element_text(angle = 90)) + xlab('Borough') + 
          ylab('Proportion')

It appears that of the 5 distinct boroughs, Manhattan has the highest proportions of parks seeing any crime. However, this has the potential to be misleading as Manhattan also tends to have the largest police presence and therefore is more likely to report crime. There appears to be a large proportion of parks on the border of Brooklyn and Queens, but since there are only 5 parks in this category, the proportion ends up being 1/5, not terribly significant.

cpp <- boroughparktotals %>% group_by(borough) %>% summarize(mean = mean(count))
cpp[3:nrow(cpp),] %>% ggplot(aes(reorder(borough, desc(mean)), mean, fill = borough)) + 
          geom_bar(stat = 'identity') + theme(axis.text.x = element_text(angle = 90)) + xlab('Borough') +
          ylab('Mean Number of Crimes/Park')

Manhattan also has the highest mean crime per park, although it is still overall rather quite small at a measely .4 crimes per park. Once again, this is possibly attributeable to larger police presence and therefore larger amounts of crimes caught rather than an absolute larger amount of crime.