1 Introduction

In a city without universal childcare, parents in New York City raising children under 5 years old are left to decide the best early childhood care programs. The 2020 census reported 532, 181 children five years of age or younger in New York City, with 35% of these children living in the Brooklyn borough. It is important these children have access to early childhood care that is safe and up to code. Annually, New York City’s Department of Health and Mental Hygiene conducts initial inspections and records any violations found. This study investigates whether characteristics such as maximum capacity and facility type had an impact on the health inspection violation rates found in Brooklyn, NY.

Based on violation rates reported in the data set, we presumed that these facilities have characteristics that affect their performance. Our first explanatory variable is maximum capacity (numerical), to understand how many children were possibly enrolled at the time of the date of inspection. We suspect the number of children in a facility could have placed stress on staff’s ability to prioritize facility maintenance. Our next explanatory variable is facility type, categorical, with two types: school-based childcare (SBCC) centers and group day care (GDC) centers. GDCs are facilities that are located anywhere, except for private homes, that operate for more than 30 days in a 12-month period. SBCCs are facilities that operate in or part of a school. Both facilities serve children who are five years old or younger (but not younger than three at SBCCs).

Unfortunately, the number of categorical variables in our data set far exceeded the numerical variables. Therefore, we decided to go with maximum capacity as this number predicted the number of children enrolled during the school year. Furthermore, there were missing and duplicate values that needed to be removed, which largely decreased the observations in the data set.


2 Exploratory data analysis

Our total sample size was 434 childcare facilities in Brooklyn, belonging to either SBCC or GDC facility types. We aggregated our cleaned data set from 27, 828 to 3,994 observations to show only GDC and SBCC facilities from the Brooklyn borough, removing duplicates and missing values. For the sake of the research question, facilities that had 0 violation rates were removed as well. Our sample produced the following summary statistics and data visualizations:

The average violation rate was the largest for school-based childcare facilities (\(\bar{x}\) = 85.5%). GDCs had an average violation rate at half that (\(\bar{x}\) = 43.8%). This was shocking given that there are more GDCs (n = 571) in our sample. In line with a higher average violation rate percent, the median violation rate percentage for SBCCs is 100%, 60 percentage points higher than that of GDCs.

Something that both GDCs and SBCCs have in common is their maximum violation rate of 100%; it is disappointing to think that there are such facilities that were incapable of providing safe and up-to-code facilities to that extent. However, the minimum violation rate for SBCCs was 50% whereas GDCs’ minimum violation rate is 12%. Given the data, SBCCs are clearly under performing compared to GDCs as safe and suitable options for childcare.

Table 1. Summary statistics of the average Violation Rate Percent for School-based and Group day care facilities in Brooklyn, NY.

Facility Type n mean_v_rate med_v_rate sd_v_rate min max
GDC 411 44.13277 40 18.93715 12.5 100
SBCC 23 81.88407 100 20.52186 50.0 100

Figure 1 shows that the most common violation rates were between 30% to 40%. The histogram appears to fit a double peak or bi modal distribution best. This trend of different distributions could have been caused by the fact that the initial inspections were conducted over a span of three years, with violation rates between 50% to 70% being the second most common.

Figure 2 is a scatter plot plotting the relationship between our two numerical variables. It appears that there is no visible correlation between the maximum capacity of facilities and their violation rate percent.

Figure 3 continues the trend of the previous figures by illustrating a substantial difference in SBCC and GDC facilities. We can see that the median and upper quartile values are the same for SBCCs (n= 100%) compared to GDCs that have a median of 35% and an upper quartile range of approximately 62%. SBCCs have a smaller lower quartile range (n= 50) compared to GDCs (n = 12.5).

The final step in our exploratory data analysis was creating a colored scatter plot showing the relationship between our response variables with both of our explanatory variables. In Figure 4, we observe that SBCCs appear to have a larger slope than GDCs. In addition to this, as maximum capacity increases, violation rate percentage decreases for SBCCs. On the other hand, GDCs appear to have a small slope and show no significant change as maximum capacity increases.


3 Multiple regression

3.1 Methods The components of our multiple linear regression model are the following:

Outcome variable y = Violation Rate Percent for each childcare facility

Numerical explanatory variable x₁ = Maximum Capacity (the maximum number of children that the facility is allowed to enroll)

Categorical explanatory variable x₂ = Facility Type (two levels: GDC = Group Daycare Centers, SBCC = School Based Childcare Centers)

The unit of analysis is an individual childcare facility in Brooklyn, since each row in our data set corresponds to one facility. We did not include an interaction effect in our final parallel-slopes model because, although Figure 4 shows that SBCCs appear to have a larger slope while GDCs show little change as maximum capacity increases, the interaction term was not statistically significant when tested, indicating that the two facility types do not differ meaningfully in their slopes.

Thus, our regression model examines whether violation rates differ by (1) maximum capacity and (2) facility type.

3.2 Model Results

term estimate std_error statistic p_value lower_ci upper_ci
intercept 45.141 1.506 29.981 0.000 42.181 48.100
Maximum Capacity -0.016 0.018 -0.866 0.387 -0.051 0.020
Facility TypeSBCC 38.081 4.165 9.144 0.000 29.895 46.266

Table 2. Regression table of violation rate as a function of maximum capacity and facility type.

Term est std_error statistic p_value lower_ci upper_ci
intercept 45.141 1.506 29.981 0.000 42.181 48.100
Maximum Capacity -0.016 0.018 -0.866 0.387 -0.051 0.020
Facility Type ’SBCC 38.081 4.165 9.144 0.000 29.895 46.266

3.3 Interpreting the Regression Model

\(\hat{ViolationRate}\) = \(b_{0}\)+ \(b_{cap}\) * capacity + \(b_{SBCC}\)\(1_{SBCC}\)(\(x_{2}\)) =45.141 - 0.016 capacity + 38.081* \(1_{SBCC}\)(\(x_{2}\))

The intercept (\(b_{0}\) = 45.141) represents the violation rate percent for GDC facilities (the baseline) when the maximum capacity is zero (Table 2).
The estimate for the slope of maxi mum capacity (\(b_{cap}\) = - 0.016) represents the associated change in violation rate percent for each one-unit increase in capacity. Based on this estimate, each additional child is associated with a decrease of 0.016 percentage points in violation rate.
The estimate for Facility Type SBCC (\(b_{SBCC}\) = 38.081) is the offset in intercept relative to the baseline group, GDC (Table 2).On average, SBCC facilities have violation rates 38.081 percentage points higher than GDC facilities. Thus, the two regression lines have equations: GDC facilities (in green): \(\hat{ViolationRate}\) = 45.141 - 0.016 * capacity SBCC facilities (in blue):\(\hat{ViolationRate}\) = 83.222 - 0.016 * capacity

3.4 Inference for multiple regression Using the output of our regression table, we test two different null hypotheses. The first null hypothesis is that there is no relationship between maximum capacity and violation rate at the population level (the population slope is zero).

\(H_{0}\) : \(B_{cap}\)= 0 vs \(H_{A}\) : \(B_{cap}\) ≠ 0

The estimated slope for maximum capacity is bcap = - 0.016. However, this relationship does not appear to be statistically meaningful, since in Table 2 we see:

The 95% confidence interval for the population slope \(B_{cap}\) is (- 0.051, 0.020), which includes 0. The p-value is 0.387, which is quite large, so we fail to reject the null hypothesis that \(B_{cap}\) = 0. So taking into account, potential sampling variation in results (for example if we collected similar data but for a different year or for a different borough), the data do not provide convincing evidence of a relationship between capacity and violation rate. The second set of null hypotheses that we test are that all the differences in intercept for the non-baseline group (SBCC) are zero. \(H_{0}\) : \(B_{SBCC}\)= 0 vs \(H_{A}\) : \(B_{SBCC}\)≠ 0

In other words, “is the intercept for GDC equal to the intercept for SBCC or not?” The observed difference in intercept was positive (\(b_{SBCC}\) 38.081). From Table 2 we observe that:

The 95% confidence interval for the population difference in intercept \(B_{SBCC}\) is (29.895, 46.266), which does not include 0. Thus it is not plausible that the difference in intercepts is zero. The p-value is rather small (p<0.001), so we reject the null hypothesis that \(B_{SBCC}\) = 0. So it appears the difference in intercept between SBCC and GDC is meaningful, and the two facility types have different overall levels of violation rate . This is consistent with our observations from the visualization of the two regression lines in Figure 4. ***

4 Conclusion

While we were interested in investigating maximum capacity’s effect on violation rates, we failed to reject our null hypothesis. Our statistical analysis found that there is no associated difference in maximum capacity (lower or higher) on violation rates during health inspections for childcare facilities in Brooklyn. From this result, we can discern that regardless of five children or 200 children in a childcare facility, childcare facilities have a duty to follow their legal and ethical obligations to the city’s governing bodies and constituents.

On the other hand, school-based childcare facilities were proven to not perform as well as group day cares in their initial annual health inspections. We accept our alternate hypothesis that SBCCs will have higher violation rates based on our regression model, which found that there is an associated increase in violation rate of 38 points for each child a facility can handle, holding other variables constant. Overall, these results indicate that different types of childcare facilities are individually challenged in how they adhere to health codes and standards. We believe that group daycare’s are motivated by their ability to retain children and parents because the costs of running their facilities rely on maintaining their consumers; whereas, publicly owned childcare programs like school-based childcare facilities are opted into district funding (though this does not remove them from facing consequences for low performance).

The Department of Health and Mental Hygiene should implement more comprehensive requirements for operating a childcare facility to help reduce violation rates. Measures could include scheduling regular inspections by city health officials to ensure compliance with standards. In addition, providing training for teachers and staff on maintaining safe, healthy facilities; and sharing information with community members such as parents and nonprofit organizations that support public safety or access to childcare. These findings are significant, as many families rely on childcare centers and expect them to provide high-quality care for their children.

4.2 Limitations

As mentioned in the introduction, this study is limited by the categorical variables exceeding the numerical variables in the data set. This is significant because for our second explanatory variable, we had to choose between Maximum Capacity and Total Educational Workers. In hindsight, we realize that there was potential to investigate how the number of staff might have had a greater impact on how these facilities were maintained and subsequently performed on their initial annual health inspection.

This study is also limited geographically; although we had data from all five boroughs of New York City, we chose to focus solely on Brooklyn. Therefore, the results of this study are specific to Brooklyn childcare facilities. Additionally, another key point to consider is state or local funding/grants that childcare facilities may apply to. School-based childcare facilities may have an opportunity to improve with the additional $100,000,000 recently allocated to the Childcare Capital Funding Program ^{3}, a state-sponsored program that approved childcare facilities may be awarded to help reconstruct their facilities. However, one of the key qualifications is to be found in good standing by Department of Health and Mental Hygiene.

4.3 Further Questions

If we had the opportunity to continue our research, we would like to investigate how home-based or seasonal camps performed on their violation rates. In addition, the quality of childcare for children in Brooklyn (and for children across New York City) is highly affected by the existence of childcare deserts^{4}. These are areas where there is an extreme lack of childcare options for parents and children. For areas where childcare options are far and few, it could be a possibility that the socioeconomic status plays a role in the upkeep and inspection of public buildings, plumbing systems, and ventilation that both school-based and group day care owners must contend with while operating year-round.

Since the results prove that violation rates are higher amongst SBCCs in Brooklyn, we would expand our analysis to investigate SBBCs in New York City to understand the distribution of violation rates across violation categories, such as public health violations and critical violations.


5 References

  1. Health. (2016, May 26). DOHMH Childcare Center Inspections. Cityofnewyork.us. https://data.cityofnewyork.us/Health/DOHMH-Childcare-Center-Inspections/dsg6-ifza/about_data
  2. Population of Children Under 5. (2025). Cccnewyork.org. https://data.cccnewyork.org/data/bar/1313/population-of-children-under-5#1313/a/1/1532/99
  3. Child Care Capital Construction Funding Program. (2025). Ny.gov. https://ocfs.ny.gov/programs/childcare/grants/construction-fund/
  4. NYS Child Care Desert Census Tract Map - 2025. (2025). Arcgis.com. https://experience.arcgis.com/experience/ed046537fae14e02a414388b34ca2f8c/page/Map

Supplementary Materials

Optional: If you have any other materials that you think are interesting, but not directly relevant to the project. For example interesting observations or a cool visualization.