Introduction
Homelessness is a pressing issue in urban areas, with profound social and economic consequences. In New York City, the Department of Homeless Services (DHS) provides shelter to thousands of families and individuals daily. This analysis focuses on the daily number of adults in families with children residing in DHS shelters, examining how seasonal trends, day-of-week variations, family size, and the proportion of adults in families impact shelter usage. The rising number of families experiencing homelessness highlights the urgent need for data-driven solutions. By analyzing shelter occupancy patterns, we can identify trends and predictors that inform policy decisions.
For example:
• Seasonal Trends: Understanding whether shelter usage peaks during colder months can help allocate resources for winter preparedness programs.
• Family Dynamics: Examining how family size and composition influence shelter occupancy can guide the development of housing programs tailored to larger families or those with more adults.
• Day-of-Week Variations: Identifying whether shelter usage varies by day can optimize staffing and service delivery.
Hypotheses
This report is guided by the following hypotheses:
Seasonal Trends: We hypothesize that shelter occupancy will be higher in winter due to colder weather, which increases the need for shelter among families.
Day-of-Week Variations: We expect that shelter occupancy will remain consistent throughout the week, as homelessness is a chronic issue not tied to specific days.
Family Size and Composition: We predict that larger families and families with a higher proportion of adults will have higher shelter occupancy due to increased housing needs and challenges in finding suitable accommodations.
Research Questions
This report addresses the following research questions:
Is there a seasonal trend in shelter usage?
Does the day of the week affect the number of adults in families with children in shelters?
How do family size and the proportion of adults in families influence shelter occupancy?
Set up and Data Preparation
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2080110 111.1 4169207 222.7 2704992 144.5
## Vcells 3580498 27.4 8388608 64.0 7792768 59.5
# Load necessary libraries
library(clarify)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(lubridate)
library(AER)
library(janitor)
library(pscl)
# Import the dataset
DHS_report <- read.csv("C:/Users/Shamp/OneDrive/Desktop/Data 712/Dataset for HW6/DHS_Daily_Report_20250320.csv")
## Data Cleaning
# Clean column names
DHS_report <- DHS_report %>%
clean_names()
# Check for missing values
sum(is.na(DHS_report))
## [1] 0
# Convert 'date_of_census' to Date format
DHS_report$date_of_census <- as.Date(DHS_report$date_of_census, format = "%m/%d/%Y")
# Verify the column exists and contains data
str(DHS_report)
## 'data.frame': 1479 obs. of 13 variables:
## $ date_of_census : Date, format: "2023-07-19" "2023-07-20" ...
## $ total_adults_in_shelter : int 53797 53891 53851 53762 53801 53954 53848 53851 53977 53883 ...
## $ total_children_in_shelter : int 28209 28225 28270 28324 28269 28291 28312 28290 28318 28333 ...
## $ total_individuals_in_shelter : int 82006 82116 82121 82086 82070 82245 82160 82141 82295 82216 ...
## $ single_adult_men_in_shelter : int 16905 16943 16962 16891 16971 17050 16952 16967 17047 16999 ...
## $ single_adult_women_in_shelter : int 4784 4794 4752 4713 4710 4753 4754 4751 4772 4682 ...
## $ total_single_adults_in_shelter : int 21689 21737 21714 21604 21681 21803 21706 21718 21819 21681 ...
## $ families_with_children_in_shelter : int 16491 16513 16527 16544 16522 16544 16564 16559 16587 16595 ...
## $ adults_in_families_with_children_in_shelter : int 25913 25940 25953 25970 25948 25979 25996 25976 26020 26038 ...
## $ children_in_families_with_children_in_shelter : int 28209 28225 28270 28324 28269 28291 28312 28290 28318 28333 ...
## $ total_individuals_in_families_with_children_in_shelter: int 54122 54165 54223 54294 54217 54270 54308 54266 54338 54371 ...
## $ adult_families_in_shelter : int 2916 2925 2910 2912 2904 2906 2895 2899 2891 2903 ...
## $ individuals_in_adult_families_in_shelter : int 6195 6214 6184 6188 6172 6172 6146 6157 6138 6164 ...
## NULL
# Derive predictors
DHS_report <- DHS_report %>%
mutate(
day_of_week = weekdays(date_of_census), # Extract day of the week
season = case_when(
month(date_of_census) %in% c(12, 1, 2) ~ "Winter",
month(date_of_census) %in% c(3, 4, 5) ~ "Spring",
month(date_of_census) %in% c(6, 7, 8) ~ "Summer",
month(date_of_census) %in% c(9, 10, 11) ~ "Fall"
),
proportion_adults_in_families = adults_in_families_with_children_in_shelter / total_adults_in_shelter,
family_size = total_individuals_in_families_with_children_in_shelter / families_with_children_in_shelter
)
# View the updated dataset
head(DHS_report)
## date_of_census total_adults_in_shelter total_children_in_shelter
## 1 2023-07-19 53797 28209
## 2 2023-07-20 53891 28225
## 3 2023-07-21 53851 28270
## 4 2023-07-22 53762 28324
## 5 2023-07-23 53801 28269
## 6 2023-07-24 53954 28291
## total_individuals_in_shelter single_adult_men_in_shelter
## 1 82006 16905
## 2 82116 16943
## 3 82121 16962
## 4 82086 16891
## 5 82070 16971
## 6 82245 17050
## single_adult_women_in_shelter total_single_adults_in_shelter
## 1 4784 21689
## 2 4794 21737
## 3 4752 21714
## 4 4713 21604
## 5 4710 21681
## 6 4753 21803
## families_with_children_in_shelter adults_in_families_with_children_in_shelter
## 1 16491 25913
## 2 16513 25940
## 3 16527 25953
## 4 16544 25970
## 5 16522 25948
## 6 16544 25979
## children_in_families_with_children_in_shelter
## 1 28209
## 2 28225
## 3 28270
## 4 28324
## 5 28269
## 6 28291
## total_individuals_in_families_with_children_in_shelter
## 1 54122
## 2 54165
## 3 54223
## 4 54294
## 5 54217
## 6 54270
## adult_families_in_shelter individuals_in_adult_families_in_shelter
## 1 2916 6195
## 2 2925 6214
## 3 2910 6184
## 4 2912 6188
## 5 2904 6172
## 6 2906 6172
## day_of_week season proportion_adults_in_families family_size
## 1 Wednesday Summer 0.4816811 3.281911
## 2 Thursday Summer 0.4813420 3.280143
## 3 Friday Summer 0.4819409 3.280874
## 4 Saturday Summer 0.4830549 3.281794
## 5 Sunday Summer 0.4822959 3.281503
## 6 Monday Summer 0.4815028 3.280343
Data Analysis
In this section, we analyze the distribution of the dependent variable (Adults_in_Families_with_Children_in_Shelter) and explore its relationship with key predictors (Day_of_Week and Season). We also check for overdispersion to determine whether Poisson or Negative Binomial regression is more appropriate for modeling this count data.
# Summary Statistics
# Calculate summary statistics for the dependent variable
summary(DHS_report$Adults_in_Families_with_Children_in_Shelter)
## Length Class Mode
## 0 NULL NULL
# Calculate the mean and variance
mean_count <- mean(DHS_report$adults_in_families_with_children_in_shelter, na.rm = TRUE)
var_count <- var(DHS_report$adults_in_families_with_children_in_shelter, na.rm = TRUE)
# Print the results
cat("Mean:", mean_count, "\nVariance:", var_count, "\n")
## Mean: 20811
## Variance: 62574034
# Interpretation: The mean and variance of the dependent variable are calculated to check for overdispersion. The result shows that the dependent variable, adults_in_families_with_children_in_shelter, has a mean of 20,811 and a variance of 62,574,034. Since the variance is significantly greater than the mean, the data exhibits overdispersion. This means that the variability in the number of adults in families with children in shelters is much higher than what would be expected under a Poisson distribution. For this case a Negative Binomial regression model is more appropriate.
# Relationship with Season
ggplot(DHS_report, aes(x = season, y = adults_in_families_with_children_in_shelter)) +
geom_boxplot(fill = "blue") +
labs(title = "Shelter Occupancy by Season",
x = "Season",
y = "Number of Adults")
# This boxplot compares the number of adults in families with children in shelters across different seasons. It helps us identify whether shelter usage varies significantly by season, which could indicate seasonal trends such as higher usage in winter due to colder weather.
# Relationship Between Family Size and Shelter Occupancy
ggplot(DHS_report, aes(x = family_size, y = adults_in_families_with_children_in_shelter)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Relationship Between Family Size and Shelter Occupancy",
x = "Family Size",
y = "Number of Adults in Families with Children")
## `geom_smooth()` using formula = 'y ~ x'
# Interpretation: The scatterplot shows the relationship between family size and the number of adults in families with children in shelters. Each point represents a data observation, and the linear regression line (Red) indicates the general trend. As family size increases, shelter occupancy also tends to increase, suggesting that larger families are more likely to seek shelter.
# Scatterplot for Proportion of Adults in Families
ggplot(DHS_report, aes(x = proportion_adults_in_families, y = adults_in_families_with_children_in_shelter)) +
geom_point(color = "green", alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Relationship Between Proportion of Adults and Shelter Occupancy",
x = "Proportion of Adults in Families",
y = "Number of Adults in Families with Children")
## `geom_smooth()` using formula = 'y ~ x'
# Interpretation: The scatterplot illustrates the relationship between the proportion of adults in families and the number of adults in families with children in shelters. Each point represents a different observation, showing how the proportion of adults in a family affects the number of adults in shelters. The linear regression line (Red) indicates a strong positive relationship, as the proportion of adults increases, shelter occupancy rises sharply.
# Regression Modeling
# Load necessary libraries
library(MASS) # For Negative Binomial regression
library(car) # For VIF calculation
# Fit Negative Binomial Regression Model
nb_model <- glm.nb(adults_in_families_with_children_in_shelter ~ family_size + proportion_adults_in_families,
data = DHS_report)
# Summarize the model
summary(nb_model)
##
## Call:
## glm.nb(formula = adults_in_families_with_children_in_shelter ~
## family_size + proportion_adults_in_families, data = DHS_report,
## init.theta = 559.3173617, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.98219 0.07254 27.32 <2e-16 ***
## family_size 2.06972 0.02708 76.44 <2e-16 ***
## proportion_adults_in_families 2.88417 0.03377 85.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(559.3174) family taken to be 1)
##
## Null deviance: 131150.7 on 1478 degrees of freedom
## Residual deviance: 1476.6 on 1476 degrees of freedom
## AIC: 24063
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 559.3
## Std. Err.: 21.2
##
## 2 x log-likelihood: -24054.79
## family_size proportion_adults_in_families
## 5.478842 5.478842
# Compare with Poisson Model (for reference)
poisson_model <- glm(adults_in_families_with_children_in_shelter ~ family_size + proportion_adults_in_families,
family = poisson, data = DHS_report)
summary(poisson_model)
##
## Call:
## glm(formula = adults_in_families_with_children_in_shelter ~ family_size +
## proportion_adults_in_families, family = poisson, data = DHS_report)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.16464 0.01208 179.2 <2e-16 ***
## family_size 2.00879 0.00442 454.5 <2e-16 ***
## proportion_adults_in_families 2.91054 0.00517 563.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4642395 on 1478 degrees of freedom
## Residual deviance: 59726 on 1476 degrees of freedom
## AIC: 77036
##
## Number of Fisher Scoring iterations: 3
##
## Overdispersion test
##
## data: poisson_model
## z = 29.745, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 40.59992
# Interpretation: The model coefficients reveal significant insights into shelter occupancy. The intercept (1.98219 on the log scale) indicates that when family size and the proportion of adults are zero, the expected number of adults in shelters is exp(1.98219) ≈ 7.26, serving as a baseline reference. Each additional family member multiplies the expected number of adults in shelters by exp(2.06972) ≈ 7.92, highlighting the greater housing challenges faced by larger families. Similarly, a one-unit increase in the proportion of adults multiplies the expected number of adults in shelters by exp(2.88417) ≈ 17.88, suggesting that families with fewer children and more adults experience higher housing instability. The model fit statistics indicate a good model fit. The residual deviance (1476.6) is significantly lower than the null deviance (131150.7), suggesting that the predictors improve the model. Additionally, the AIC value (24063) is much lower than that of the Poisson model (77036), confirming that the Negative Binomial model is a better fit for the data. The theta value (559.3174) confirms that the data exhibits overdispersion, justifying the use of Negative Binomial regression instead of Poisson regression. While the theta value is large, the significant reduction in deviance and AIC compared to the Poisson model supports the use of the Negative Binomial model. Key findings from the model show that larger families (p < 2e-16) and a higher proportion of adults in families (p < 2e-16) are significant risk factors for homelessness among families with children. These results suggest that housing policies should prioritize support for larger families and those with a higher proportion of adults.
Interpret the Results using Clarify
# Simulate coefficients for the Negative Binomial model
sim_results <- sim(nb_model, n = 1000)
# Define new data for prediction
new_data <- data.frame(
family_size = c(2, 3, 4, 5), # Example family sizes
proportion_adults_in_families = mean(DHS_report$proportion_adults_in_families) # Hold proportion constant
)
# Calculate predicted counts
predicted_counts <- sim_apply(sim_results, FUN = predict, newdata = new_data, type = "response")
# Summarize the predicted counts
summary(predicted_counts)
## Estimate 2.5 % 97.5 %
## 1 1678 1576 1782
## 2 13295 13165 13418
## 3 105333 100862 110129
## 4 834510 758531 920696
# Interpretation of Predicted Counts: Using the clarify package, we simulated predictions from the Negative Binomial model to estimate shelter occupancy for different family sizes while keeping the proportion of adults in families constant. The results confirm that as family size increases, shelter occupancy grows exponentially, highlighting the heightened risk for larger families.
# Check the structure of predicted_counts to identify the correct column names
str(predicted_counts)
## 'clarify_est' num [1:1000, 1:4] 1622 1741 1757 1597 1689 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:4] "1" "2" "3" "4"
## - attr(*, "original")= Named num [1:4] 1678 13295 105333 834510
## ..- attr(*, "names")= chr [1:4] "1" "2" "3" "4"
## - attr(*, "sim_hash")= chr "c709fa4ca5ce99e94c4063f4cdab554c"
# Define new data for prediction
new_data <- data.frame(
family_size = c(2, 3, 4, 5), # Example family sizes
proportion_adults_in_families = mean(DHS_report$proportion_adults_in_families) # Hold proportion constant
)
# Calculate predicted counts
predicted_counts <- sim_apply(sim_results, FUN = predict, newdata = new_data, type = "response")
# Calculate mean and confidence intervals manually
predicted_counts_df <- data.frame(
family_size = c(2, 3, 4, 5),
mean = mean(predicted_counts),
lower = apply(predicted_counts, 2, quantile, 0.025),
upper = apply(predicted_counts, 2, quantile, 0.975)
)
# Create a bar plot with the extracted values
ggplot(predicted_counts_df, aes(x = factor(family_size), y = mean)) +
geom_bar(stat = "identity", fill = "blue") +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) +
labs(title = "Predicted Shelter Occupancy by Family Size",
x = "Family Size",
y = "Predicted Number of Adults") +
theme_minimal()
# Summarize the predicted counts
summary(predicted_counts)
## Estimate 2.5 % 97.5 %
## 1 1678 1576 1782
## 2 13295 13165 13418
## 3 105333 100862 110129
## 4 834510 758531 920696
# The graph illustrates the predicted number of adults in shelters for families of sizes 2, 3, 4, and 5. The predicted number of adults generally increases with family size. However, the prediction for families of size 5 has a significantly wider confidence interval, indicating a higher degree of uncertainty in the model's prediction for this family size. This suggests that the model's predictions for larger families are less precise. Possible reasons for this uncertainty could be limitations in the model's fit or higher variability in the data for larger families.
# Visualize the distribution of the dependent variable
ggplot(DHS_report, aes(x = adults_in_families_with_children_in_shelter)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black") +
labs(title = "Distribution of Adults in Families with Children in Shelter",
x = "Number of Adults",
y = "Frequency")
# This histogram shows the distribution of the number of adults in families with children in shelters. Most families have close to 10,000 adults, as indicated by the high peak. The data is skewed right, with a long tail extending to 30,000 adults. This suggests that while most families have a lower number of adults, there's significant variability, and some families have considerably higher numbers. The right skew indicates the mean number of adults is likely higher than the median. Further investigation is needed to understand the factors causing these higher counts.
# Plot 1: Adults in Families by Day of the Week
ggplot(DHS_report, aes(x = day_of_week, y = adults_in_families_with_children_in_shelter)) +
geom_boxplot(fill = "orange") +
labs(title = "Adults in Families by Day of the Week",
x = "Day of the Week",
y = "Number of Adults")
# The graph shows that the number of adults in families residing in shelters remains stable throughout the week. The boxplots for each day are similar, with median values hovering around the same level. The interquartile ranges (IQR) and whiskers also show comparable spread and overlap, indicating minimal impact of the day of the week on shelter occupancy. There are no significant outliers or extreme variations in the data.
# Plot 2: Adults in Families by Season
ggplot(DHS_report, aes(x = season, y = adults_in_families_with_children_in_shelter)) +
geom_boxplot(fill = "darkcyan") +
labs(title = "Adults in Families by Season",
x = "Season",
y = "Number of Adults")
# The boxplot illustrates that the number of adults in families with children in shelters varies significantly by season. Winter shows the highest median and interquartile range, indicating increased shelter occupancy during colder months. In contrast, summer has the lowest median and IQR, suggesting fewer adults in shelters. This seasonal trend aligns with expectations, as families may seek shelter more during winter due to harsh weather conditions.
Regression Modeling
# Fit Negative Binomial Regression Model
nb_model <- glm.nb(adults_in_families_with_children_in_shelter ~ day_of_week + season + proportion_adults_in_families + family_size,
data = DHS_report)
summary(nb_model)
##
## Call:
## glm.nb(formula = adults_in_families_with_children_in_shelter ~
## day_of_week + season + proportion_adults_in_families + family_size,
## data = DHS_report, init.theta = 601.9278997, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.0128851 0.0708551 28.408 < 2e-16 ***
## day_of_weekMonday 0.0016483 0.0040305 0.409 0.683
## day_of_weekSaturday -0.0004562 0.0040351 -0.113 0.910
## day_of_weekSunday 0.0004457 0.0040350 0.110 0.912
## day_of_weekThursday 0.0017311 0.0040352 0.429 0.668
## day_of_weekTuesday 0.0020002 0.0040305 0.496 0.620
## day_of_weekWednesday 0.0017935 0.0040353 0.444 0.657
## seasonSpring -0.0011120 0.0031102 -0.358 0.721
## seasonSummer -0.0249934 0.0030808 -8.113 4.95e-16 ***
## seasonWinter 0.0035110 0.0030990 1.133 0.257
## proportion_adults_in_families 2.8860843 0.0334100 86.384 < 2e-16 ***
## family_size 2.0612203 0.0265762 77.559 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(601.9279) family taken to be 1)
##
## Null deviance: 140828.0 on 1478 degrees of freedom
## Residual deviance: 1476.6 on 1467 degrees of freedom
## AIC: 23976
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 601.9
## Std. Err.: 22.8
##
## 2 x log-likelihood: -23949.62
# Use clarify to simulate and interpret results
sim_results <- sim(nb_model, n = 1000)
clarify_summary <- summary(sim_results)
clarify_summary
## Length Class Mode
## sim.coefs 12000 -none- numeric
## coefs 12 -none- numeric
## vcov 144 -none- numeric
## fit 30 negbin list
# Extract quantities of interest
new_data <- data.frame(
day_of_week = "Monday",
season = c("Winter", "Spring", "Summer", "Fall"),
proportion_adults_in_families = mean(DHS_report$proportion_adults_in_families),
family_size = mean(DHS_report$family_size)
)
# Corrected line: Use sim_apply() instead of predict()
predicted_counts <- sim_apply(sim_results, FUN = predict, newdata = new_data, type = "response")
predicted_counts
## A `clarify_est` object (from `sim_apply()`)
## - 1000 simulated values
## - 4 quantities estimated:
## 1 19384.48
## 2 19295.07
## 3 18839.74
## 4 19316.54
# Calculate mean and CI for plotting or further analysis
predicted_counts_df <- data.frame(
season = c("Winter", "Spring", "Summer", "Fall"),
mean = mean(predicted_counts),
lower = apply(predicted_counts, 2, quantile, 0.025),
upper = apply(predicted_counts, 2, quantile, 0.975)
)
predicted_counts_df
## season mean lower upper
## 1 Winter 19207.68 19262.97 19515.90
## 2 Spring 19207.68 19172.17 19419.80
## 3 Summer 19207.68 18715.01 18958.23
## 4 Fall 19207.68 19193.22 19439.04
# Plot the predicted counts by season
library(ggplot2)
ggplot(predicted_counts_df, aes(x = season, y = mean)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) +
labs(title = "Predicted Shelter Occupancy by Season (Holding Monday Constant)",
x = "Season",
y = "Predicted Number of Adults") +
theme_minimal()
# This bar graph displays the predicted number of adults in families with children in shelters across different seasons, specifically for Mondays. The error bars represent the confidence intervals of these predictions. Winter shows the highest predicted adult occupancy, while summer shows the lowest, with fall and spring being relatively close. This aligns with our earlier findings that shelter usage tends to be higher in winter, likely due to colder weather and increased need for shelter. The model's predictions support the hypothesis that seasonal factors significantly influence shelter occupancy.
Conclusion
The analysis of shelter occupancy patterns in NYC’s DHS system highlights key factors influencing family homelessness. Occupancy peaks in winter due to harsh weather, while summer sees a decline. Daily shelter usage remains steady, indicating persistent housing instability. Larger families and those with more adults face higher risks of homelessness, as confirmed by a Negative Binomial regression model.
These findings stress the need for targeted interventions, such as expanding affordable housing for larger households and increasing winter support. Model limitations suggest further research incorporating income and employment data. Understanding these trends enables policymakers to design effective solutions, improving housing stability for vulnerable families.