This analysis explores count data regression using a dataset on the Individual Census by Borough, Community District, and Facility Type in NYC. We will: - Load and clean the dataset - Fit a Poisson regression model to predict shelter occupancy - Check for overdispersion - Fit a Negative Binomial regression model for better accuracy - Compute Incidence Rate Ratios (IRRs) for interpretation
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS) # For Negative Binomial Model
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(pscl) # For count data regression tools
## Warning: package 'pscl' was built under R version 4.3.3
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
We begin by loading the dataset into R. Ensure the dataset is correctly referenced from its file path before running the following command.
df <- read.csv("C:/Users/marc.ventura/OneDrive - OneWorkplace/Data 765 Python Fundementals/Data 712/HW6/Individual_Census_by_Borough__Community_District__and_Facility_Type_20250324.csv")
Before running our models, we: - Convert Report.Date
to
a date format - Convert Borough
to a categorical variable -
Fill missing values in count columns with 0
(assuming
missing means no shelters of that type in that area) - Create a new
variable Total_Shelter_Count
by summing all shelter
types
# Verify column names
print(colnames(df))
## [1] "Report.Date"
## [2] "Borough"
## [3] "Community.Districts"
## [4] "Census.Type"
## [5] "Adult.Family.Commercial.Hotel"
## [6] "Adult.Family.Shelter"
## [7] "Adult.Shelter"
## [8] "Adult.Shelter.Commercial.Hotel"
## [9] "Family.Cluster"
## [10] "Family.with.Children.Commercial.Hotel"
## [11] "Family.with.Children.Shelter"
# Convert Report Date to Date format
df$Report.Date <- as.Date(df$Report.Date, format="%m/%d/%Y")
df$Borough <- as.factor(df$Borough)
# Fill NA values with 0 for count columns
count_cols <- c("Adult.Family.Commercial.Hotel", "Adult.Family.Shelter", "Adult.Shelter",
"Adult.Shelter.Commercial.Hotel", "Family.Cluster",
"Family.with.Children.Commercial.Hotel", "Family.with.Children.Shelter")
df[count_cols][is.na(df[count_cols])] <- 0
# Create total shelter count column
df$Total_Shelter_Count <- rowSums(df[count_cols])
The Poisson regression is commonly used for count data, assuming the variance equals the mean. We use it to model total shelter occupancy as a function of borough.
poisson_model <- glm(Total_Shelter_Count ~ Borough, family = poisson, data = df)
summary(poisson_model)
##
## Call:
## glm(formula = Total_Shelter_Count ~ Borough, family = poisson,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.295454 0.000857 8512.9 <2e-16 ***
## BoroughBrooklyn -0.409153 0.001213 -337.3 <2e-16 ***
## BoroughManhattan -0.305121 0.001316 -231.9 <2e-16 ***
## BoroughQueens -0.287693 0.001254 -229.3 <2e-16 ***
## BoroughStaten Island -2.259795 0.005374 -420.5 <2e-16 ***
## BoroughWestchester -2.123164 0.016079 -132.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4565158 on 4564 degrees of freedom
## Residual deviance: 4117056 on 4559 degrees of freedom
## AIC: 4151517
##
## Number of Fisher Scoring iterations: 6
Poisson regression assumes that the variance is equal to the mean. If variance is significantly higher than the mean (overdispersion), we need to switch to Negative Binomial regression.
overdispersion_ratio <- var(df$Total_Shelter_Count) / mean(df$Total_Shelter_Count)
overdispersion_ratio
## [1] 1162.975
If the ratio is much greater than 1, overdispersion exists, and the Poisson model is not appropriate.
Since overdispersion is present, we use the Negative Binomial model, which allows variance to exceed the mean.
neg_binom_model <- glm.nb(Total_Shelter_Count ~ Borough, data = df)
summary(neg_binom_model)
##
## Call:
## glm.nb(formula = Total_Shelter_Count ~ Borough, data = df, init.theta = 0.5206170786,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.29545 0.04560 159.982 < 2e-16 ***
## BoroughBrooklyn -0.40915 0.05887 -6.950 3.66e-12 ***
## BoroughManhattan -0.30512 0.06449 -4.731 2.23e-06 ***
## BoroughQueens -0.28769 0.06215 -4.629 3.67e-06 ***
## BoroughStaten Island -2.25980 0.10209 -22.135 < 2e-16 ***
## BoroughWestchester -2.12316 0.29941 -7.091 1.33e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.5206) family taken to be 1)
##
## Null deviance: 5983.7 on 4564 degrees of freedom
## Residual deviance: 5640.5 on 4559 degrees of freedom
## AIC: 70975
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.5206
## Std. Err.: 0.0101
##
## 2 x log-likelihood: -70961.2710
IRRs help interpret the coefficients by converting log estimates into multiplicative changes in shelter occupancy rates.
exp(coef(neg_binom_model))
## (Intercept) BoroughBrooklyn BoroughManhattan
## 1473.5854978 0.6642127 0.7370345
## BoroughQueens BoroughStaten Island BoroughWestchester
## 0.7499919 0.1043719 0.1196525