Introduction

This analysis explores count data regression using a dataset on the Individual Census by Borough, Community District, and Facility Type in NYC. We will: - Load and clean the dataset - Fit a Poisson regression model to predict shelter occupancy - Check for overdispersion - Fit a Negative Binomial regression model for better accuracy - Compute Incidence Rate Ratios (IRRs) for interpretation

Load Libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(MASS)  # For Negative Binomial Model

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(pscl)  # For count data regression tools

## Warning: package 'pscl' was built under R version 4.3.3

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

Load Data

We begin by loading the dataset into R. Ensure the dataset is correctly referenced from its file path before running the following command.

df <- read.csv("C:/Users/marc.ventura/OneDrive - OneWorkplace/Data 765 Python Fundementals/Data 712/HW6/Individual_Census_by_Borough__Community_District__and_Facility_Type_20250324.csv")

Data Cleaning & Preprocessing

Before running our models, we: - Convert Report.Date to a date format - Convert Borough to a categorical variable - Fill missing values in count columns with 0 (assuming missing means no shelters of that type in that area) - Create a new variable Total_Shelter_Count by summing all shelter types

# Verify column names
print(colnames(df))

##  [1] "Report.Date"                          
##  [2] "Borough"                              
##  [3] "Community.Districts"                  
##  [4] "Census.Type"                          
##  [5] "Adult.Family.Commercial.Hotel"        
##  [6] "Adult.Family.Shelter"                 
##  [7] "Adult.Shelter"                        
##  [8] "Adult.Shelter.Commercial.Hotel"       
##  [9] "Family.Cluster"                       
## [10] "Family.with.Children.Commercial.Hotel"
## [11] "Family.with.Children.Shelter"

# Convert Report Date to Date format
df$Report.Date <- as.Date(df$Report.Date, format="%m/%d/%Y")
df$Borough <- as.factor(df$Borough)

# Fill NA values with 0 for count columns
count_cols <- c("Adult.Family.Commercial.Hotel", "Adult.Family.Shelter", "Adult.Shelter", 
                "Adult.Shelter.Commercial.Hotel", "Family.Cluster", 
                "Family.with.Children.Commercial.Hotel", "Family.with.Children.Shelter")
df[count_cols][is.na(df[count_cols])] <- 0

# Create total shelter count column
df$Total_Shelter_Count <- rowSums(df[count_cols])

Poisson Regression Model

The Poisson regression is commonly used for count data, assuming the variance equals the mean. We use it to model total shelter occupancy as a function of borough.

poisson_model <- glm(Total_Shelter_Count ~ Borough, family = poisson, data = df)
summary(poisson_model)

## 
## Call:
## glm(formula = Total_Shelter_Count ~ Borough, family = poisson, 
##     data = df)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           7.295454   0.000857  8512.9   <2e-16 ***
## BoroughBrooklyn      -0.409153   0.001213  -337.3   <2e-16 ***
## BoroughManhattan     -0.305121   0.001316  -231.9   <2e-16 ***
## BoroughQueens        -0.287693   0.001254  -229.3   <2e-16 ***
## BoroughStaten Island -2.259795   0.005374  -420.5   <2e-16 ***
## BoroughWestchester   -2.123164   0.016079  -132.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4565158  on 4564  degrees of freedom
## Residual deviance: 4117056  on 4559  degrees of freedom
## AIC: 4151517
## 
## Number of Fisher Scoring iterations: 6

Overdispersion Check

Poisson regression assumes that the variance is equal to the mean. If variance is significantly higher than the mean (overdispersion), we need to switch to Negative Binomial regression.

overdispersion_ratio <- var(df$Total_Shelter_Count) / mean(df$Total_Shelter_Count)
overdispersion_ratio

## [1] 1162.975

If the ratio is much greater than 1, overdispersion exists, and the Poisson model is not appropriate.

Negative Binomial Regression Model

Since overdispersion is present, we use the Negative Binomial model, which allows variance to exceed the mean.

neg_binom_model <- glm.nb(Total_Shelter_Count ~ Borough, data = df)
summary(neg_binom_model)

## 
## Call:
## glm.nb(formula = Total_Shelter_Count ~ Borough, data = df, init.theta = 0.5206170786, 
##     link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           7.29545    0.04560 159.982  < 2e-16 ***
## BoroughBrooklyn      -0.40915    0.05887  -6.950 3.66e-12 ***
## BoroughManhattan     -0.30512    0.06449  -4.731 2.23e-06 ***
## BoroughQueens        -0.28769    0.06215  -4.629 3.67e-06 ***
## BoroughStaten Island -2.25980    0.10209 -22.135  < 2e-16 ***
## BoroughWestchester   -2.12316    0.29941  -7.091 1.33e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.5206) family taken to be 1)
## 
##     Null deviance: 5983.7  on 4564  degrees of freedom
## Residual deviance: 5640.5  on 4559  degrees of freedom
## AIC: 70975
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.5206 
##           Std. Err.:  0.0101 
## 
##  2 x log-likelihood:  -70961.2710

Compute Incidence Rate Ratios (IRRs)

IRRs help interpret the coefficients by converting log estimates into multiplicative changes in shelter occupancy rates.

exp(coef(neg_binom_model))

##          (Intercept)      BoroughBrooklyn     BoroughManhattan 
##         1473.5854978            0.6642127            0.7370345 
##        BoroughQueens BoroughStaten Island   BoroughWestchester 
##            0.7499919            0.1043719            0.1196525

Conclusion

The Negative Binomial model fits better than Poisson due to overdispersion.
Bronx has the highest shelter occupancy, while Staten Island and Westchester have the lowest.
These insights can guide resource allocation and policy decisions for shelter management in NYC.

Homework 6: Count Data Regression Analysis

Marc Brian Ventura

2025-03-24