NYC Open Data provides numerous datasets about the city “as part of an initiative to improve the accessibility, transparency, and accountability of City government.” My presentation focuses on how public data can help ordinary citizens better understand—and potentially improve—the quality of life in New York City. While my analysis centers around two pre-existing data sets and a relationship between them, it focuses, as much, on how future data collection can be improved to better address the aforementioned goal of holistic improvement.
Many NYC Open Data datasets, such as 311 service request logs, provide valuable information for policymakers, administrators, or individuals with substantial financial or political power. However, these datasets are often difficult for ordinary residents to act upon. The majority of New Yorkers, for example, do not have the capacity to meaningfully influence the housing market.
That said, there are certain types of information that (i) can be directly acted upon by individuals and (ii) can be translated into concrete, low-barrier actions. The field of positive psychology, which consistently finds that strong social relationships are the most reliable predictors of well-being, provides one such framework for identifying this information. One, when considering this area of research, might ask the following: Can publicly available data be used to explore the conditions that best facilitate social connectedness, and thereby, most enhance quality of life?
The answer, at the moment, is a tentative yes. At present, NYC Open Data does not include the validated measures psychologists typically use to assess metrics like social connectedness and well-being. Instead, researchers and citizens must rely on rough proxies — such as economic metrics. However, over time, the number of resources amenable to the type of analysis I propose can be expanded.
In this exploratory analysis, I examine whether the number of permitted events in a community district (a rough measure of social connectedness) predicts the number of SNAP recipients per month (a rough measure of economic health and, thereby, overall well-being). Rather than treat these variables as definitive measures, I use them as an opportunity to demonstrate how lucrative this mode of research can be. I conclude, also, with a number of suggestions as to how data collection in this field can best be facilitated.
library(tidyverse)
library(nycOpenData)
library(dplyr)
library(stringr)
First, I loaded records of NYC permitted events and NYC borough community reports using the NYC Open Data package that my professor (Christian Martinez) created.
Events <- nyc_permit_events_historic(limit = 10000, filters = list())
BoroReport <- nyc_borough_community_report(limit = 10000, filters = list())
glimpse(Events)
## Rows: 10,000
## Columns: 11
## $ event_agency <chr> "43, ", "43, ", "43, ", "43, ", "43, ", "43, ", "N…
## $ event_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "21124", N…
## $ event_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ganando A…
## $ start_date_time <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ end_date_time <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ event_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ event_borough <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ event_location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ street_closure_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ community_board <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ police_precinct <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
glimpse(BoroReport)
## Rows: 1,711
## Columns: 9
## $ month <chr> "2025-09-01T00:00:00.000", "2025-09-01T00:00:00.…
## $ borough <chr> "Staten_Island", "Staten_Island", "Staten_Island…
## $ community_district <chr> "S03", "S02", "S01", "Q14", "Q13", "Q12", "Q11",…
## $ bc_snap_recipients <chr> "14469", "19401", "41291", "32329", "24177", "53…
## $ bc_snap_households <chr> "9014", "11080", "22547", "18077", "15506", "326…
## $ bc_ca_recipients <chr> "3561", "4635", "16929", "14228", "8303", "23653…
## $ bc_ca_cases <chr> "2058", "2612", "8311", "7051", "4954", "12689",…
## $ bc_ma_only_enrollees <chr> "7510", "9603", "12480", "10278", "13676", "2225…
## $ bc_total_ma_enrollees <chr> "13710", "18324", "37025", "31230", "25712", "54…
After this, I removed all non-numeric characters from the community board listings in events.
eventscleaner <- Events %>%
mutate(
community_board =
community_board |>
str_replace_all("[^0-9]", "") |>
as.numeric()
)
In the borough report, I separated the community district field into a borough identifier and a numeric community board. I then recoded boroughs as numeric prefixes and combined these with the community board numbers to create a standardized community district ID.
BoroReport <- BoroReport %>%
mutate(
snap_borough = str_extract(community_district, "^[A-Za-z]") |> str_to_upper(),
snap_cb = str_extract(community_district, "[0-9]+") |> as.numeric()
) %>%
mutate(
snap_borough_num = case_when(
snap_borough == "M" ~ 100, # Manhattan
snap_borough == "B" ~ 200, # Bronx
snap_borough == "K" ~ 300, # Brooklyn
snap_borough == "Q" ~ 400, # Queens
snap_borough == "S" ~ 500, # Staten Island
TRUE ~ NA_real_
),
cd_id = snap_borough_num + snap_cb
)
Finally, I applied this same numbering pattern to the events data sheet. I replaced the borough names with numbers and added these numbers to the community districts.
eventscleaner <- eventscleaner %>%
mutate(
borough_num = case_when(
event_borough == "Manhattan" ~ 100,
event_borough == "Bronx" ~ 200,
event_borough == "Brooklyn" ~ 300,
event_borough == "Queens" ~ 400,
event_borough == "Staten_Island" ~ 500,
event_borough == "Staten Island" ~ 500,
TRUE ~ NA_real_
),
cd_id = borough_num + community_board
)
After this, I counted the number of events per community district – just to garner a better understanding of the data.
events_cd <- eventscleaner %>%
count(cd_id, name = "n_events")
print(events_cd)
## # A tibble: 32 × 2
## cd_id n_events
## <dbl> <int>
## 1 101 2
## 2 107 6
## 3 108 277
## 4 109 12
## 5 111 96
## 6 164 235
## 7 211 11
## 8 228 213
## 9 301 16
## 10 302 13
## # ℹ 22 more rows
I then created a graph to display the number of events per district, in descending order. Numbers starting with 1 are in Manhattan, numbers starting with 2 are in the Bronx, numbers starting with 3 are in Brooklyn, numbers starting with 4 are in Queens, and numbers starting with 5 are in Staten Island. NYC has 59 community districts. The district numbers correspond to the last two numbers on the Y axis. Manhattan has 12 community districts, the Bronx has 12, Brooklyn has 18, Queens has 14, and Staten Island has 3. Numbers that do not fit into this schema (such as 55 and 64) refer to “joint-interest areas.” 55, for instance, is Prospect Park; and 64 is Central Park. A full list can be seen here.
ggplot(events_cd, aes(x = reorder(cd_id, n_events), y = n_events)) +
geom_col() +
coord_flip() +
labs(
title = "Number of Events by Community District",
x = "Community District",
y = "Number of Events"
) +
theme_minimal()
I also looked over the number of SNAP recipients per district. The table belows shows the number of SNAP recipients per month per community district. (Note: There are no necessarily equal amounts of people per community district, so number of SNAP recipients within a given district is not a de facto indication of the proportional amount of poverty in the area. That said, it still functions as a meaningful snapshot of poverty rates).
BoroReport <- BoroReport %>%
mutate(
bc_snap_recipients = as.numeric(bc_snap_recipients)
)
snap_table <- BoroReport %>%
group_by(cd_id) %>%
summarise(
bc_snap_recipients = first(bc_snap_recipients),
.groups = "drop"
) %>%
arrange(desc(bc_snap_recipients)) %>%
rename(
`Community District` = cd_id,
`SNAP Recipients` = bc_snap_recipients
)
knitr::kable(
snap_table,
caption = "SNAP Recipients by NYC Community District"
)
| Community District | SNAP Recipients |
|---|---|
| 312 | 72954 |
| 305 | 66887 |
| 204 | 58144 |
| 209 | 57656 |
| 205 | 55868 |
| 412 | 53866 |
| 301 | 53694 |
| 207 | 50067 |
| 112 | 47829 |
| 201 | 46014 |
| 303 | 45496 |
| 212 | 44754 |
| 203 | 43642 |
| 311 | 42019 |
| 111 | 41971 |
| 501 | 41291 |
| 206 | 39878 |
| 407 | 38431 |
| 316 | 37297 |
| 103 | 36631 |
| 317 | 35480 |
| 314 | 34969 |
| 313 | 34151 |
| 315 | 33309 |
| 318 | 32478 |
| 414 | 32329 |
| 110 | 30823 |
| 211 | 28792 |
| 404 | 28606 |
| 401 | 26630 |
| 409 | 26116 |
| 403 | 25425 |
| 413 | 24177 |
| 307 | 23112 |
| 408 | 23079 |
| 202 | 22794 |
| 210 | 22567 |
| 304 | 22447 |
| 309 | 22223 |
| 109 | 22012 |
| 405 | 20634 |
| 308 | 20201 |
| 310 | 20058 |
| 208 | 19702 |
| 410 | 19401 |
| 502 | 19401 |
| 107 | 17042 |
| 503 | 14469 |
| 302 | 14330 |
| 406 | 14111 |
| 104 | 14004 |
| 402 | 13205 |
| 411 | 10421 |
| 306 | 10011 |
| 106 | 7464 |
| 108 | 6913 |
| 105 | 5531 |
| 102 | 2893 |
| 101 | 2193 |
Finally, I merged the two datasheets using the community district names I created earlier.
merged <- BoroReport %>%
left_join(events_cd, by = "cd_id")
glimpse(merged)
## Rows: 1,711
## Columns: 14
## $ month <chr> "2025-09-01T00:00:00.000", "2025-09-01T00:00:00.…
## $ borough <chr> "Staten_Island", "Staten_Island", "Staten_Island…
## $ community_district <chr> "S03", "S02", "S01", "Q14", "Q13", "Q12", "Q11",…
## $ bc_snap_recipients <dbl> 14469, 19401, 41291, 32329, 24177, 53866, 10421,…
## $ bc_snap_households <chr> "9014", "11080", "22547", "18077", "15506", "326…
## $ bc_ca_recipients <chr> "3561", "4635", "16929", "14228", "8303", "23653…
## $ bc_ca_cases <chr> "2058", "2612", "8311", "7051", "4954", "12689",…
## $ bc_ma_only_enrollees <chr> "7510", "9603", "12480", "10278", "13676", "2225…
## $ bc_total_ma_enrollees <chr> "13710", "18324", "37025", "31230", "25712", "54…
## $ snap_borough <chr> "S", "S", "S", "Q", "Q", "Q", "Q", "Q", "Q", "Q"…
## $ snap_cb <dbl> 3, 2, 1, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3…
## $ snap_borough_num <dbl> 500, 500, 500, 400, 400, 400, 400, 400, 400, 400…
## $ cd_id <dbl> 503, 502, 501, 414, 413, 412, 411, 410, 409, 408…
## $ n_events <int> NA, NA, NA, NA, 22, 84, 392, NA, NA, 747, 151, N…
I then conducted a linear regression to determine whether number of permitted events predicts number of SNAP recipients. The model was statistically significant, F(1, 723) = 45.34, p < .001, and explained approximately 6% of the variance in SNAP recipients (R² = .059). The number of events was a significant negative predictor of SNAP recipients, b = −21.30, SE = 3.16, t(723) = −6.73, p < .001.
(Note: The model dropped all NA values, meaning that the joint interest areas were dropped from the analysis).
model1 <- lm(bc_snap_recipients ~ n_events, data = merged)
summary(model1)
##
## Call:
## lm(formula = bc_snap_recipients ~ n_events, data = merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28081 -12534 -2646 8391 44572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30043.800 715.843 41.970 < 2e-16 ***
## n_events -21.303 3.164 -6.733 3.38e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16210 on 723 degrees of freedom
## (986 observations deleted due to missingness)
## Multiple R-squared: 0.059, Adjusted R-squared: 0.0577
## F-statistic: 45.34 on 1 and 723 DF, p-value: 3.385e-11
Despite the significant p-value of this analysis, there are a number of limitations. As mentioned in the introduction, the number of SNAP recipients is an imperfect measure of economic well-being (not to mention holistic well-being). Likewise, permitted events are an imperfect indicator of social gatherings in an area. At a more granular level, community districts are not normalized by population size, and major hubs of social activity—such as parks—are excluded from the regression.
However, these limitations point to ways in which data collection could be improved. Below, I outline several possibilities for instantiating such improvements:
Better Dependent Variables
To meaningfully assess quality of life in NYC, future datasets should include more varied indicators of well-being and capture outcomes across the income distribution. Ideally, validated population-level measures of well-being and social connectedness would be available for use as dependent variables. In addition, economic proxies for well-being (such as median income) should be collected. Diverse datasets of this sort would provide a more complete picture of the psychological and economic well-being of NYC residents.
More Information about Social Gatherings
Currently, NYC Open Data has information about permitted events. Yet, there are countless other social gatherings that could be quantified as well. These include volunteer opportunities, Meetup groups, Eventbrite activities, Reddit meetups, and more. While an exhaustive catalog of social gatherings is not feasible, expanded coverage of accessible, low-barrier events would strengthen any analyses of social life in the city. It would also allow analysts to subdivide events in meaningful ways.
Geographic Information
Community districts provide a useful organizational unit, but many NYC datasets lack this data. In addition, even more detailed neighborhood-level data on events might provide information about areas with a shortage (or surplus) of social activity. Identifying such areas might support more strategic intervention. Finally, knowledge of individuals’ willingness (or lack of willingness) to travel might provide yet more valuable information. The prominence of parks in the event data suggests that social life is often organized around specific hubs. The practical accessibility of these hubs is yet another concept worth exploring.
Concrete Suggestions
There is no “control New York City.” As such, causality cannot be established through the analyses I describe. Nevertheless, if evidence were to suggest that certain types of social activities were associated with positive psychological outcomes, it would then be possible to recommend concrete actions to citizens who wished to improve civic and social life in New York. In this way, improved data infrastructure could help foster a stronger sense of civic autonomy among New Yorkers – as well as a happier, healthier New York City.