This analysis explores homelessness trends in NYC from 2009 to 2012 using count data regression models. The primary objective is to investigate factors influencing homelessness estimates across different boroughs using Negative Binomial Regression (NB).
install.packages(“plotly”)
# load libraries
library(clarify)
library(AER)
library(tidyverse)
library(janitor)
library(pscl)
library(MASS)
library(plotly)
library(ggplot2)
# load data
data <- read.csv("C:/Users/susha/Downloads/Directory_Of_Homeless_Population_By_Year_20250325.csv")
# check data
head(data)
## Year Area Homeless.Estimates
## 1 2009 Surface Area - Manhattan 777
## 2 2009 Surface Area - Bronx 164
## 3 2009 Surface Area - Brooklyn 200
## 4 2009 Surface Area - Queens 98
## 5 2009 Surface Area - Staten Island 121
## 6 2009 Surface Total 1360
str(data)
## 'data.frame': 32 obs. of 3 variables:
## $ Year : int 2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 ...
## $ Area : chr "Surface Area - Manhattan " "Surface Area - Bronx " "Surface Area - Brooklyn " "Surface Area - Queens " ...
## $ Homeless.Estimates: int 777 164 200 98 121 1360 968 2328 1145 174 ...
# Check for missing values
sum(is.na(data))
## [1] 0
# Count distribution
ggplot(data, aes(x = Homeless.Estimates)) +
geom_histogram(bins = 10, fill = "blue", alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of Homeless Estimates")
# Homeless counts by year and area
ggplot(data, aes(x = Year, y = Homeless.Estimates, fill = Area)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Homeless Population by Year and Borough",
x = "Year", y = "Homeless Estimates")
# variance vs mean for overdispersion
var(data$Homeless.Estimates)
## [1] 877293.2
mean(data$Homeless.Estimates)
## [1] 909.1562
# Fit Poisson Model
model_poisson <- glm(Homeless.Estimates ~ Year + Area, family = poisson, data = data)
# Model summary
summary(model_poisson)
##
## Call:
## glm(formula = Homeless.Estimates ~ Year + Area, family = poisson,
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.263e+02 1.056e+01 -11.96 <2e-16 ***
## Year 6.637e-02 5.254e-03 12.63 <2e-16 ***
## AreaSurface Area - Bronx -2.075e+00 4.254e-02 -48.78 <2e-16 ***
## AreaSurface Area - Brooklyn -1.387e+00 3.177e-02 -43.64 <2e-16 ***
## AreaSurface Area - Manhattan -3.154e-01 2.188e-02 -14.42 <2e-16 ***
## AreaSurface Area - Queens -2.521e+00 5.210e-02 -48.40 <2e-16 ***
## AreaSurface Area - Staten Island -2.248e+00 4.598e-02 -48.90 <2e-16 ***
## AreaSurface Total 2.553e-01 1.893e-02 13.49 <2e-16 ***
## AreaTotal Unsheltered Individuals 8.289e-01 1.703e-02 48.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 28240.72 on 31 degrees of freedom
## Residual deviance: 665.81 on 23 degrees of freedom
## AIC: 940.12
##
## Number of Fisher Scoring iterations: 4
# Overdispersion test
dispersiontest(model_poisson)
##
## Overdispersion test
##
## data: model_poisson
## z = 3.6192, p-value = 0.0001478
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 21.4785
Results: Dispersion = 21.4785, p-value < 0.001 → Overdispersion exists. Negative Binomial Regression is preferred over Poisson Regression.
# Fit Negative Binomial Model
model_nb <- glm.nb(Homeless.Estimates ~ Year + Area, data = data)
summary(model_nb)
##
## Call:
## glm.nb(formula = Homeless.Estimates ~ Year + Area, data = data,
## init.theta = 32.99077244, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -60.21896 58.39879 -1.031 0.3025
## Year 0.03349 0.02905 1.153 0.2489
## AreaSurface Area - Bronx -2.06699 0.13025 -15.869 < 2e-16 ***
## AreaSurface Area - Brooklyn -1.38350 0.12716 -10.880 < 2e-16 ***
## AreaSurface Area - Manhattan -0.30833 0.12504 -2.466 0.0137 *
## AreaSurface Area - Queens -2.51315 0.13367 -18.801 < 2e-16 ***
## AreaSurface Area - Staten Island -2.23854 0.13140 -17.036 < 2e-16 ***
## AreaSurface Total 0.26215 0.12456 2.105 0.0353 *
## AreaTotal Unsheltered Individuals 0.83276 0.12429 6.700 2.08e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(32.9908) family taken to be 1)
##
## Null deviance: 1265.510 on 31 degrees of freedom
## Residual deviance: 32.161 on 23 degrees of freedom
## AIC: 397.92
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 32.99
## Std. Err.: 9.14
##
## 2 x log-likelihood: -377.919
table(data$Homeless.Estimates == 0)
##
## FALSE
## 32
# Predictions
predict(model_nb, type = "response")
## 1 2 3 4 5 6 7
## 859.65465 148.09628 293.34823 94.79435 124.75015 1520.82949 1170.11463
## 8 9 10 11 12 13 14
## 2690.85397 888.93306 153.14020 303.33918 98.02289 128.99894 1572.62643
## 15 16 17 18 19 20 21
## 1209.96680 2782.50000 919.20865 158.35590 313.67041 101.36139 133.39243
## 22 23 24 25 26 27 28
## 1626.18748 1251.17626 2877.26735 950.51537 163.74924 324.35351 104.81359
## 29 30 31 32
## 137.93555 1681.57273 1293.78925 2975.26231
ggplot(data, aes(x = Year, y = Homeless.Estimates, color = Area)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
theme_minimal() +
labs(title = "Trends in Homelessness Estimates by Area (2009-2012)", x = "Year", y = "Homeless Estimates") -> p
ggplotly(p)
## `geom_smooth()` using formula = 'y ~ x'
# Bar plot of the total homeless estimates by area
ggplot(data, aes(x = Area, y = Homeless.Estimates, fill = Area)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Total Homeless Estimates by Area", x = "Area", y = "Homeless Estimates") -> p2
p2 <- p2 + theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p2)
predicted_counts <- predict(model_nb, type = "response")
data$Predicted <- predicted_counts
ggplot(data, aes(x = Area, y = Predicted, fill = Area)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Predicted Homelessness Estimates by Borough", x = "Borough", y = "Predicted Homeless Count")
In this analysis, I explored homelessness trends in NYC from 2009 to 2012 using count data regression techniques. Since homelessness estimates are count data, I first tested for overdispersion in a Poisson regression model. The results showed a dispersion value of 21.48 with a p-value < 0.001, confirming that a Negative Binomial regression model was more appropriate. After fitting the model, I found that Year was not statistically significant (p = 0.2489), meaning homelessness estimates did not change significantly over time. However, borough-level differences were significant, with Queens and Staten Island having the lowest estimates compared to Manhattan.
One of the most important findings was that Total Unsheltered Individuals was a strong predictor of homelessness estimates (p < 0.001). This suggests that boroughs with more unsheltered individuals tend to have higher overall homelessness estimates. Additionally, the results indicated that Bronx and Brooklyn had significantly lower estimates than Manhattan, highlighting that homelessness is concentrated in certain areas. These findings directly answer my research questions: homelessness did not change significantly from 2009–2012, Queens and Staten Island had the lowest estimates, and total unsheltered individuals strongly predicted homelessness rates.
Through visualizations, I was able to further explore these patterns, confirming the disparities in homelessness across different boroughs. This analysis made me think more about the role of economic conditions, shelter policies, and external factors in shaping these trends. In the future, I would like to expand this model by incorporating additional variables, such as unemployment rates or housing policies, to gain a deeper understanding of the factors influencing homelessness in NYC.