Projet 3

Author

Myriam O.

Projet 3 Assignment

NYC Pool Inpsections

Introduction

My project focuses on pool inspection across different boroughs of New York City from the NYC Open Data Pool Inspection dataset. The dataset includes information about pool locations and name, community board, Council district, census tract, boroughs, permit types, inspection types, PHH violations, critical violations, general violations and geographic coordinates.

To prepare the data, I changed borough abbreviations into full borough names, converted the inspection date into a proper date format. I created and renamed some variables. I selected only the variables that were important for my analysis and visualizations. I also filtered to include only relevant variables.

I chose this topic because I did not know that pools were regularly inspected in NYC, so I became curious about how pool inspections work and why they are important. I also wanted to learn more about the types of violations found during inspections and how they can affect public safety and health.

Variables:

Variable Type	Variable Name	Description
Categorical	Borough	Name of NYC borough where the pool inspection took place
Categorical	Permit_type	Type of permit for the pool facility (Indoor or outdoor)
Categorical	Inspection_type	Type of inspection performed
Categorical	Facility_name	Name of the pool facility being inspected
Categorical	Year	Year when the inspection took place
Numerical	# of PHH Violations	Number of Public Health Hazard violations found during inspection
Numerical	# of critical violations	Number of critical violations found during inspection

Load libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(leaflet)
library(ggridges)

Read data

setwd("~/Downloads/First data 110 assignment_files")
pool <- read_csv("NYC_Pool_Inspections.csv")

Rows: 5842 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Permit_Type, Facility_Name, ADDRESS_ No, ADDRESS_St, BO, Inspectio...
dbl (13): ACCELA, ZIP, # of All Violations, # of PHH Violations, # of Critic...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

CLEANING

1. Use borough full name

pool <- pool |>
  mutate(Borough = case_when(
    BO == "MA" ~ "Manhattan",
    BO == "BX" ~ "Bronx",
    BO == "BK" ~ "Brooklyn",
    BO == "QU" ~ "Queens",
    BO == "SI" ~ "Staten Island",
    TRUE ~ BO
  ))

I use mutate to change the borough abbreviations into full borough names to make the dataset easier to read and understand.

2. Change date to year

pool2 <- pool |>
  mutate(date = mdy(Inspection_Date)) |>
  mutate(year = year(date))

I converted the inspection date into a proper date format and create a year variable to keep only the inspection year.

3. Rename

pool2 <- pool2 |>
  rename(Num_Critical_Violations = `# of Critical Violations`, 
         Num_PHH_Violations = `# of PHH Violations`)

I renamed the violation variables to make the column names easier to read in the dataset and visualizations.

4. Select important variables

pool_inspection <- pool2 |>
  select(Permit_Type, Facility_Name, year, Inspection_Type, 
         Lat, Long, Borough, Num_PHH_Violations,
         Num_Critical_Violations)

I used select to keep only the variables that are important for my analysis and visualizations.

5. Remove NA values

pool_inspection <- pool_inspection |>
  filter(!is.na(Lat),
         !is.na(Long),
         !is.na(Num_PHH_Violations),
         !is.na(Num_Critical_Violations))

I removed rows with missing values to make the dataset cleaner

Most common inspection types

pool_inspection |>
  count(Inspection_Type, sort = TRUE)

# A tibble: 7 × 2
  Inspection_Type             n
  <chr>                   <int>
1 Routine Inspection       3833
2 Re-Inspection             618
3 Complaint Inspection      208
4 Complaint Re-inspection    19
5 Investigation              10
6 Complaint Re-Inspection     9
7 Office Visit                4

I used count to find the most common inspection types in the dataset.

Linear regression

fit1 <- lm(Num_Critical_Violations ~ Num_PHH_Violations + year +
             Borough + Inspection_Type, data = pool_inspection)
summary(fit1)


Call:
lm(formula = Num_Critical_Violations ~ Num_PHH_Violations + year + 
    Borough + Inspection_Type, data = pool_inspection)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4128 -0.2560 -0.1611 -0.0904  4.6564 

Coefficients:
                                        Estimate Std. Error t value Pr(>|t|)
(Intercept)                            65.458902   8.304222   7.883 3.96e-15
Num_PHH_Violations                      0.239784   0.012950  18.516  < 2e-16
year                                   -0.032316   0.004108  -7.866 4.51e-15
BoroughBrooklyn                        -0.060463   0.033525  -1.804 0.071370
BoroughManhattan                       -0.091569   0.030390  -3.013 0.002600
BoroughQueens                          -0.038043   0.034456  -1.104 0.269610
BoroughStaten Island                   -0.155375   0.040341  -3.852 0.000119
Inspection_TypeComplaint Re-inspection -0.018245   0.130856  -0.139 0.889118
Inspection_TypeComplaint Re-Inspection  0.279584   0.185450   1.508 0.131724
Inspection_TypeInvestigation           -0.077436   0.176359  -0.439 0.660623
Inspection_TypeOffice Visit            -0.088772   0.274816  -0.323 0.746693
Inspection_TypeRe-Inspection            0.130294   0.043743   2.979 0.002910
Inspection_TypeRoutine Inspection       0.135560   0.038828   3.491 0.000485
                                          
(Intercept)                            ***
Num_PHH_Violations                     ***
year                                   ***
BoroughBrooklyn                        .  
BoroughManhattan                       ** 
BoroughQueens                             
BoroughStaten Island                   ***
Inspection_TypeComplaint Re-inspection    
Inspection_TypeComplaint Re-Inspection    
Inspection_TypeInvestigation              
Inspection_TypeOffice Visit               
Inspection_TypeRe-Inspection           ** 
Inspection_TypeRoutine Inspection      ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5441 on 4688 degrees of freedom
Multiple R-squared:  0.08714,   Adjusted R-squared:  0.0848 
F-statistic: 37.29 on 12 and 4688 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(fit1)

To understand what factors are related to critical pool violations, I created a multiple linear regression model using PHH violations, year, borough, and inspection type. After looking at the regression results, I found that PHH violations, year, and some inspection types were important variables because they had small p-values. The regression equation for predicting critical violations is: Predicted Critical Violations = 65.46 + 0.23(Num_PHH_Violations) - 0.03(year) - 0.09(Manhattan) - 0.16(Staten Island) + 0.13(Re-Inspection) + 0.14(Routine Inspection)

This means that when the number of PHH violations increases by one, the predicted number of critical violations increases by about 0.23 while the other variables stay the same. The negative value for year suggests that critical violations slightly decreased over time, which may show improvements in pool safety in recent years.

The adjusted R^2 value was about 0.085. This means the model explains about 8.5% of the changes in critical violations. When looking at the diagnostic plots, most points followed the general pattern of the model, although some points were farther away from the others.

Number of inspections by year

pool_inspection |>
  count(year, sort = TRUE)

# A tibble: 6 × 2
   year     n
  <dbl> <int>
1  2019  1266
2  2024  1166
3  2021   746
4  2023   554
5  2022   486
6  2020   483

I did this chunk to see how many inspections were recorded for each year in the dataset.

Filter

pool_inspection2 <- pool_inspection |>
  filter(Borough %in% c("Manhattan", "Staten Island", "Brooklyn"),
         Inspection_Type %in% c("Routine Inspection", "Re-Inspection"),
         year == 2024)

I filtered the dataset to keep only Brooklyn, Manhattan, and Staten Island based on the linear regression analysis. I also kept only routine inspections and re-inspections because there are the most relevant inspection types. I selected the year 2024 because it is the most recent year in the dataset.

Ridge plot of PHH violations by borough

Final_plot <- ggplot(pool_inspection2,aes(x = Num_PHH_Violations,
           y = Borough,
           fill = Borough)) +
  geom_density_ridges(alpha = 0.9, 
                      scale = 4, 
                      color = "black",
                      bandwidth = 0.3) +
  scale_x_continuous(breaks = 0:5) +
  scale_fill_manual(values = c(
    "Manhattan" = "#bc17f4",
    "Brooklyn" = "#0bc9b7",
    "Staten Island" = "#9aca0b"
  )) +
  labs(
    title = "Distribution of PHH Violations Across NYC Borough
    in Pool Inspections (2024)",
    x = "Number of PHH Violations",
    y = "Borough",
    fill = "Borough",
    caption = "Source: NYC Open Data Pool Inspection") +
  theme_minimal(base_size = 12)
Final_plot

This ridge plot shows the distribution of Public Health Hazards(PHH) violations across Brooklyn, Manhattan, and Staten Island in 2024. Most inspections in all three boroughs had low PHH violations because the highest part of the shapes is close to 0. Manhattan has the longest shape going to the right, meaning that some inspections had higher numbers of PHH violations. Brooklyn also slightly extends to the right, meaning a few inspections had higher violation too, although not as much as Manhattan. Staten Island has a more concentrated distribution, which indicates that most inspections had lower and similar PHH violation values.

Map of pool inspections in NYC(2024)

pal <- colorFactor(palette = c("Brooklyn" = "#ca0b0b",
                               "Manhattan" = "#bc17f4",
                               "Staten Island" = "#9aca0b"),
                   domain = pool_inspection2$Borough)

leaflet(pool_inspection2) |>
  addProviderTiles(providers$OpenStreetMap) |>
  setView(lng = -73.94, lat = 40.70, zoom = 11) |>
  addCircleMarkers(
    lng = ~Long,
    lat = ~Lat,
    radius = ~pmax(8, Num_PHH_Violations * 5),
    color = ~pal(Borough),
    fillColor = ~pal(Borough),
    opacity = 1,
    fillOpacity = 0.9,
    stroke = TRUE,
    weight = 1,
    clusterOptions = markerClusterOptions(),
    popup = ~paste0(
      "<b>Facility:</b> ", Facility_Name,
      "<br><b>Borough:</b> ", Borough,
      "<br><b>Permit Type:</b> ", Permit_Type,
      "<br><b>Inspection Type:</b> ", Inspection_Type,
      "<br><b>PHH Violations:</b> ", Num_PHH_Violations,
      "<br><b>Critical Violations:</b> ", Num_Critical_Violations)) |>
  addLegend(
    position = "bottomright",
    pal = pal,
    values = ~Borough,
    title = "Borough")

My map shows the pool inspections of some NYC borough in 2024. Each circle represents a pool facility, and larger circles indicate locations with violations. The colors represent the boroughs. The map also shows that some inspections are grouped close to each other, especially in parts of Brooklyn and Manhattan, which suggests that several inspected facilities are located in the same areas. By clicking on the circles, we can view additional information such as the facility name, permit type, inspection type, and number of violations.

While working on this project, I learned more about how pool inspections help protect public health and safety in NYC. Before this project, I did not know that pools were regularly inspected, so it was interesting to explore how inspections are conducted and how violations are recorded.

Background Information

Pool inspections in New York City are conducted by the NYC Department of Health and Mental Hygiene (DOHMH) to make sure public pools follow health and safety regulations. Inspectors check different conditions such as water quality, chlorine and pH levels, cleanliness, safety equipment, warning signs, and possible health hazards. Public pools must also follow rules related to sanitation, emergency equipment, and safe maintenance practices. According to the NYC Open Data Pool Inspection dataset, inspections are completed for both indoor and outdoor pools across New York City. These inspections are important because they help protect swimmers from injuries, unsafe conditions, and water-related illnesses.

In addition, pools in New York City are required to have permits and certified pool operators who are trained in pool safety and water treatment. Health officials regularly monitor pools to make sure they remain clean and safe for the public.

Sources

“Bathing Establishment with Pool Permit Information.” NYC Business, City of New York, https://www.nyc.gov/site/doh/business/permits-and-licenses/pools-spas-and-spray-grounds.page

NYC Pool Inspections.” NYC Open Data, City of New York.https://data.cityofnewyork.us/Health/NYC-Pool-Inspections/3kfa-rvez/about_data

Swimming Pools, Spas, and Bathing Beaches.” American Legal Publishing, New York City Rules, https://codelibrary.amlegal.com/codes/newyorkcity/latest/NYCrules/0-0-0-47432

Tips and Tricks.” An Introduction to R, https://intro2r.com/

Zhang, Zhenguo. “Tutorial: How to Create and Tune Ridgeline Density Plot Using Ggridges.” R-bloggers, 30 June 2023, https://www.r-bloggers.com/2023/06/tutorial-how-to-create-and-tune-ridgeline-density-plot-using-ggridges/