Introduction

This analysis explores homelessness trends in NYC from 2009 to 2012 using count data regression models. The primary objective is to investigate factors influencing homelessness estimates across different boroughs using Negative Binomial Regression (NB).

Research Questions

  1. Did homelessness estimates change significantly over the years (2009–2012)?
  2. Which borough had the highest and lowest homelessness estimates?
  3. Are total unsheltered individuals a significant predictor of homelessness estimates?

install.packages(“plotly”)

# load libraries
library(clarify)
library(AER)
library(tidyverse)
library(janitor)
library(pscl)
library(MASS)
library(plotly)
library(ggplot2)
# load data
data <- read.csv("C:/Users/susha/Downloads/Directory_Of_Homeless_Population_By_Year_20250325.csv")
# check data
head(data)
##   Year                          Area Homeless.Estimates
## 1 2009     Surface Area - Manhattan                 777
## 2 2009         Surface Area - Bronx                 164
## 3 2009      Surface Area - Brooklyn                 200
## 4 2009        Surface Area - Queens                  98
## 5 2009 Surface Area - Staten Island                 121
## 6 2009                Surface Total                1360
str(data)
## 'data.frame':    32 obs. of  3 variables:
##  $ Year              : int  2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 ...
##  $ Area              : chr  "Surface Area - Manhattan " "Surface Area - Bronx " "Surface Area - Brooklyn " "Surface Area - Queens " ...
##  $ Homeless.Estimates: int  777 164 200 98 121 1360 968 2328 1145 174 ...
# Check for missing values
sum(is.na(data))
## [1] 0
# Count distribution
ggplot(data, aes(x = Homeless.Estimates)) + 
  geom_histogram(bins = 10, fill = "blue", alpha = 0.7) + 
  theme_minimal() +
  labs(title = "Distribution of Homeless Estimates")

# Homeless counts by year and area
ggplot(data, aes(x = Year, y = Homeless.Estimates, fill = Area)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Homeless Population by Year and Borough",
       x = "Year", y = "Homeless Estimates")

# variance vs mean for overdispersion
var(data$Homeless.Estimates)
## [1] 877293.2
mean(data$Homeless.Estimates)
## [1] 909.1562
# Fit Poisson Model
model_poisson <- glm(Homeless.Estimates ~ Year + Area, family = poisson, data = data)

# Model summary
summary(model_poisson)
## 
## Call:
## glm(formula = Homeless.Estimates ~ Year + Area, family = poisson, 
##     data = data)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -1.263e+02  1.056e+01  -11.96   <2e-16 ***
## Year                                6.637e-02  5.254e-03   12.63   <2e-16 ***
## AreaSurface Area - Bronx           -2.075e+00  4.254e-02  -48.78   <2e-16 ***
## AreaSurface Area - Brooklyn        -1.387e+00  3.177e-02  -43.64   <2e-16 ***
## AreaSurface Area - Manhattan       -3.154e-01  2.188e-02  -14.42   <2e-16 ***
## AreaSurface Area - Queens          -2.521e+00  5.210e-02  -48.40   <2e-16 ***
## AreaSurface Area - Staten Island   -2.248e+00  4.598e-02  -48.90   <2e-16 ***
## AreaSurface Total                   2.553e-01  1.893e-02   13.49   <2e-16 ***
## AreaTotal Unsheltered Individuals   8.289e-01  1.703e-02   48.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 28240.72  on 31  degrees of freedom
## Residual deviance:   665.81  on 23  degrees of freedom
## AIC: 940.12
## 
## Number of Fisher Scoring iterations: 4
# Overdispersion test
dispersiontest(model_poisson)
## 
##  Overdispersion test
## 
## data:  model_poisson
## z = 3.6192, p-value = 0.0001478
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##    21.4785

Results: Dispersion = 21.4785, p-value < 0.001 → Overdispersion exists. Negative Binomial Regression is preferred over Poisson Regression.

# Fit Negative Binomial Model

model_nb <- glm.nb(Homeless.Estimates ~ Year + Area, data = data)
summary(model_nb)
## 
## Call:
## glm.nb(formula = Homeless.Estimates ~ Year + Area, data = data, 
##     init.theta = 32.99077244, link = log)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -60.21896   58.39879  -1.031   0.3025    
## Year                                 0.03349    0.02905   1.153   0.2489    
## AreaSurface Area - Bronx            -2.06699    0.13025 -15.869  < 2e-16 ***
## AreaSurface Area - Brooklyn         -1.38350    0.12716 -10.880  < 2e-16 ***
## AreaSurface Area - Manhattan        -0.30833    0.12504  -2.466   0.0137 *  
## AreaSurface Area - Queens           -2.51315    0.13367 -18.801  < 2e-16 ***
## AreaSurface Area - Staten Island    -2.23854    0.13140 -17.036  < 2e-16 ***
## AreaSurface Total                    0.26215    0.12456   2.105   0.0353 *  
## AreaTotal Unsheltered Individuals    0.83276    0.12429   6.700 2.08e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(32.9908) family taken to be 1)
## 
##     Null deviance: 1265.510  on 31  degrees of freedom
## Residual deviance:   32.161  on 23  degrees of freedom
## AIC: 397.92
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  32.99 
##           Std. Err.:  9.14 
## 
##  2 x log-likelihood:  -377.919
table(data$Homeless.Estimates == 0)
## 
## FALSE 
##    32
# Predictions
predict(model_nb, type = "response")
##          1          2          3          4          5          6          7 
##  859.65465  148.09628  293.34823   94.79435  124.75015 1520.82949 1170.11463 
##          8          9         10         11         12         13         14 
## 2690.85397  888.93306  153.14020  303.33918   98.02289  128.99894 1572.62643 
##         15         16         17         18         19         20         21 
## 1209.96680 2782.50000  919.20865  158.35590  313.67041  101.36139  133.39243 
##         22         23         24         25         26         27         28 
## 1626.18748 1251.17626 2877.26735  950.51537  163.74924  324.35351  104.81359 
##         29         30         31         32 
##  137.93555 1681.57273 1293.78925 2975.26231

Visualizations

ggplot(data, aes(x = Year, y = Homeless.Estimates, color = Area)) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) +
  theme_minimal() +
  labs(title = "Trends in Homelessness Estimates by Area (2009-2012)", x = "Year", y = "Homeless Estimates") -> p
ggplotly(p)
## `geom_smooth()` using formula = 'y ~ x'
# Bar plot of the total homeless estimates by area
ggplot(data, aes(x = Area, y = Homeless.Estimates, fill = Area)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Total Homeless Estimates by Area", x = "Area", y = "Homeless Estimates") -> p2
p2 <- p2 + theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p2)

Predicted Homeless by Borough

predicted_counts <- predict(model_nb, type = "response")
data$Predicted <- predicted_counts
ggplot(data, aes(x = Area, y = Predicted, fill = Area)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Predicted Homelessness Estimates by Borough", x = "Borough", y = "Predicted Homeless Count")

Conclusion

In this analysis, I explored homelessness trends in NYC from 2009 to 2012 using count data regression techniques. Since homelessness estimates are count data, I first tested for overdispersion in a Poisson regression model. The results showed a dispersion value of 21.48 with a p-value < 0.001, confirming that a Negative Binomial regression model was more appropriate. After fitting the model, I found that Year was not statistically significant (p = 0.2489), meaning homelessness estimates did not change significantly over time. However, borough-level differences were significant, with Queens and Staten Island having the lowest estimates compared to Manhattan.

One of the most important findings was that Total Unsheltered Individuals was a strong predictor of homelessness estimates (p < 0.001). This suggests that boroughs with more unsheltered individuals tend to have higher overall homelessness estimates. Additionally, the results indicated that Bronx and Brooklyn had significantly lower estimates than Manhattan, highlighting that homelessness is concentrated in certain areas. These findings directly answer my research questions: homelessness did not change significantly from 2009–2012, Queens and Staten Island had the lowest estimates, and total unsheltered individuals strongly predicted homelessness rates.

Through visualizations, I was able to further explore these patterns, confirming the disparities in homelessness across different boroughs. This analysis made me think more about the role of economic conditions, shelter policies, and external factors in shaping these trends. In the future, I would like to expand this model by incorporating additional variables, such as unemployment rates or housing policies, to gain a deeper understanding of the factors influencing homelessness in NYC.