HomeWork#6

2025-03-24

Load Libraries

library(readr)
library(dplyr)
library(clarify)
library(AER)
library(MASS)
library(ggplot2)
library(pscl)

Start Session

Import Data

park_crime_data <- read_csv("nyc_park_crime_stats_q2_2023.csv")
park_crime_data <- rename(park_crime_data, "Park_Size_Acres"  = `Park Size(Acres)`, "Total_Crimes" = `Total # of Crimes`)

Description of the Data

  • The data provides information on the number of crimes that occurred in New York City parks from April 1st, 2023 to June 30, 2023.

Structure of the Data

glimpse(park_crime_data)
## Rows: 1,154
## Columns: 13
## $ ...1                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ PARK                       <chr> "PELHAM BAY PARK", "VAN CORTLANDT PARK", "R…
## $ Borough                    <chr> "BRONX", "BRONX", "QUEENS", "STATEN ISLAND"…
## $ Park_Size_Acres            <dbl> 2771.747, 1146.430, 1072.564, 913.320, 897.…
## $ Category                   <chr> "ONE ACRE OR LARGER", "ONE ACRE OR LARGER",…
## $ `# of Murders`             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `# of Rapes`               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `# of Robberies`           <dbl> 0, 0, 0, 0, 7, 0, 0, 0, 4, 0, 0, 1, 2, 0, 0…
## $ `# of Felony Assaults`     <dbl> 0, 0, 0, 0, 13, 0, 0, 0, 1, 0, 0, 4, 0, 0, …
## $ `# of Burglaries`          <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ `# of Grand Larcenies`     <dbl> 0, 0, 1, 1, 46, 0, 2, 1, 0, 0, 1, 5, 1, 0, …
## $ `# of Grand Larceny Autos` <dbl> 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Total_Crimes               <dbl> 0, 0, 1, 1, 71, 0, 2, 1, 6, 0, 1, 10, 3, 0,…

Data Table

DT::datatable(park_crime_data)

Methods

Research Question

  • Is there a relationship between Park Size(Acres) and the number of crimes that occur(Total # of Crimes)?

Hypothesis

  • Null Hypothesis: There is no relationship between Park Size(Acres) and the number of crimes that occur(Total # of Crimes).
  • Alternative Hypothesis: There is a relationship between Park Size(Acres) and the number of crimes that occur(Total # of Crimes).

Steps

  • I will use a Poisson regression model to determine if there is a relationship between Park Size(Acres) and the number of crimes that occur(Total # of Crimes).

  • I will then test to see if the variance is equal to the mean for the dependent variable(Total # of Crimes).

  • If the variance is not equal to the mean, I will use a negative binomial regression.

  • I will also check to see if there are many zeros in my data. If there are many zeros, I will use a zero-inflated negative binomial regression model.

Analysis

Perform Poisson Regression Model

model1 <- glm(Total_Crimes ~ Park_Size_Acres, family = poisson, data = park_crime_data)
summary(model1)
## 
## Call:
## glm(formula = Total_Crimes ~ Park_Size_Acres, family = poisson, 
##     data = park_crime_data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.357e+00  5.743e-02  -23.64   <2e-16 ***
## Park_Size_Acres  1.688e-03  7.104e-05   23.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2146.0  on 1153  degrees of freedom
## Residual deviance: 1928.4  on 1152  degrees of freedom
## AIC: 2228.3
## 
## Number of Fisher Scoring iterations: 14

Test for Overdispersion( Is the mean eqaul to the variance?)

dispersiontest(model1)
## 
##  Overdispersion test
## 
## data:  model1
## z = 1.7689, p-value = 0.03845
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   9.295107

Results

  • The variance is not equal to the mean.
  • The p values is < .05, so we can reject the null hypothesis that the variance is equal to the mean, and accept the alternative hypothesis that the variance is greater than the mean.
  • Also the dispersion estimate is 9.29, which is greater than 1, indicating that the variance is much greater than the mean(over dispersion)

Are there many zeros in the data?

ggplot(park_crime_data, aes(x = Total_Crimes)) + geom_histogram()

There are a great number of zeros in my data, so I will perform a zero-inflated negative binomial regression model.

Zero-Inflated Negative Binomial Regression Model

model2 <- zeroinfl(Total_Crimes ~ Park_Size_Acres, data = park_crime_data, dist = "negbin")
options(scipen=999) #Remove scientific notation
summary(model2)
## 
## Call:
## zeroinfl(formula = Total_Crimes ~ Park_Size_Acres, data = park_crime_data, 
##     dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4125 -0.2716 -0.1965 -0.1703 28.5466 
## 
## Count model coefficients (negbin with log link):
##                   Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)     -1.0349547  0.1644396  -6.294        0.00000000031 ***
## Park_Size_Acres  0.0036002  0.0007632   4.717        0.00000239280 ***
## Log(theta)      -1.7708600  0.1739763 -10.179 < 0.0000000000000002 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.3502     0.3761   3.590 0.000331 ***
## Park_Size_Acres  -0.5973     0.2214  -2.697 0.006992 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 0.1702 
## Number of iterations in BFGS optimization: 32 
## Log-likelihood: -537.3 on 5 Df

Interpretation

(Count Model Coefficients)

  • The p value for Park_Size_Acres is < .05, so there is a significant relationship between Park_Size_Acres and the number of crimes that occur(Total_Crimes).

  • The estimate for Park_Size_Acres is .0036, which means that for one additional acre of park size, the expected number of crimes increases by about 0.0036 (on a log scale).

  • Park size significantly affects the number of crimes that occur. As park size increases the number of crimes increases.

  • The log(theta) = -1.77 indicating over dispersion. The log(theta) is < 1, which means the negative binomial model is appropriate.

(Zero-Inflation Model Coefficients)

  • The coefficient Park_Size_Acres has a p value of .006992, which is < .05, so there is a significant relationship between Park_Size_Acres and the probability of observing a zero count.

  • The coefficient Park_Size_Acres has an Estimate of -0.5973. This means that for each additional acre of park size, the odds of having zero crimes decrease by exp(-0.5973). This means that larger parks are less likely to have zero crimes (more likely to have some crimes). As park size increases the odds of having zero crimes decreases.

Average Marginal Effects

set.seed(123)
sim_coefs4 <- sim(model2)
sim_est4 <- sim_ame(sim_coefs4, var = "Park_Size_Acres",
                    contrast = "rd")
summary(sim_est4)
##                          Estimate  2.5 % 97.5 %
## E[dY/d(Park_Size_Acres)]   0.0606 0.0202 1.9369

Interpretation

  • The confidence interval falls between 0.0202 and 1.9369. Since zero doesn`t fall between the confidence interval, we can conclude that there is a statistically significant relationship between Park Size and the number of crimes that occur.

  • The average marginal effect of Park_Size_Acres is 0.606. This means that for one additional acre of park size, the expected number of crimes increases by about .0606.

Dose-response relationship: Prediction and plot

sim_est4b <- sim_adrf(sim_coefs4, var = "Park_Size_Acres",
                    contrast = "adrf")
plot(sim_est4b)

Interpretation

  • The shaded area represents the 95% confidence interval, showing the range where the true values might fall.

  • If you look closely at the graph you can see that the solid line remains relatively flat from 0 acres up until around 2300 acres, indicating no noticeable increase in crime as the size increases. However, as acre size increases from 2400 acres to 2700 acres, the number of crimes begins to rise slightly.

  • So It can be said that larger parks are associated with more crimes.