My project focuses on pool inspection across different boroughs of New York City from the NYC Open Data Pool Inspection dataset. The dataset includes information about pool locations and name, community board, Council district, census tract, boroughs, permit types, inspection types, PHH violations, critical violations, general violations and geographic coordinates.
To prepare the data, I changed borough abbreviations into full borough names, converted the inspection date into a proper date format. I created and renamed some variables. I selected only the variables that were important for my analysis and visualizations. I also filtered to include only relevant variables.
I chose this topic because I did not know that pools were regularly inspected in NYC, so I became curious about how pool inspections work and why they are important. I also wanted to learn more about the types of violations found during inspections and how they can affect public safety and health.
Variables:
Variable Type
Variable Name
Description
Categorical
Borough
Name of NYC borough where the pool inspection took place
Categorical
Permit_type
Type of permit for the pool facility (Indoor or outdoor)
Categorical
Inspection_type
Type of inspection performed
Categorical
Facility_name
Name of the pool facility being inspected
Categorical
Year
Year when the inspection took place
Numerical
# of PHH Violations
Number of Public Health Hazard violations found during inspection
Numerical
# of critical violations
Number of critical violations found during inspection
Load libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)library(ggridges)
Read data
setwd("~/Downloads/First data 110 assignment_files")pool <-read_csv("NYC_Pool_Inspections.csv")
Rows: 5842 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Permit_Type, Facility_Name, ADDRESS_ No, ADDRESS_St, BO, Inspectio...
dbl (13): ACCELA, ZIP, # of All Violations, # of PHH Violations, # of Critic...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
CLEANING
1. Use borough full name
pool <- pool |>mutate(Borough =case_when( BO =="MA"~"Manhattan", BO =="BX"~"Bronx", BO =="BK"~"Brooklyn", BO =="QU"~"Queens", BO =="SI"~"Staten Island",TRUE~ BO ))
I use mutate to change the borough abbreviations into full borough names to make the dataset easier to read and understand.
2. Change date to year
pool2 <- pool |>mutate(date =mdy(Inspection_Date)) |>mutate(year =year(date))
I converted the inspection date into a proper date format and create a year variable to keep only the inspection year.
3. Rename
pool2 <- pool2 |>rename(Num_Critical_Violations =`# of Critical Violations`, Num_PHH_Violations =`# of PHH Violations`)
I renamed the violation variables to make the column names easier to read in the dataset and visualizations.
To understand what factors are related to critical pool violations, I created a multiple linear regression model using PHH violations, year, borough, and inspection type. After looking at the regression results, I found that PHH violations, year, and some inspection types were important variables because they had small p-values. The regression equation for predicting critical violations is: Predicted Critical Violations = 65.46 + 0.23(Num_PHH_Violations) - 0.03(year) - 0.09(Manhattan) - 0.16(Staten Island) + 0.13(Re-Inspection) + 0.14(Routine Inspection)
This means that when the number of PHH violations increases by one, the predicted number of critical violations increases by about 0.23 while the other variables stay the same. The negative value for year suggests that critical violations slightly decreased over time, which may show improvements in pool safety in recent years.
The adjusted R^2 value was about 0.085. This means the model explains about 8.5% of the changes in critical violations. When looking at the diagnostic plots, most points followed the general pattern of the model, although some points were farther away from the others.
Number of inspections by year
pool_inspection |>count(year, sort =TRUE)
# A tibble: 6 × 2
year n
<dbl> <int>
1 2019 1266
2 2024 1166
3 2021 746
4 2023 554
5 2022 486
6 2020 483
I did this chunk to see how many inspections were recorded for each year in the dataset.
I filtered the dataset to keep only Brooklyn, Manhattan, and Staten Island based on the linear regression analysis. I also kept only routine inspections and re-inspections because there are the most relevant inspection types. I selected the year 2024 because it is the most recent year in the dataset.
Ridge plot of PHH violations by borough
Final_plot <-ggplot(pool_inspection2,aes(x = Num_PHH_Violations,y = Borough,fill = Borough)) +geom_density_ridges(alpha =0.9, scale =4, color ="black",bandwidth =0.3) +scale_x_continuous(breaks =0:5) +scale_fill_manual(values =c("Manhattan"="#bc17f4","Brooklyn"="#0bc9b7","Staten Island"="#9aca0b" )) +labs(title ="Distribution of PHH Violations Across NYC Borough in Pool Inspections (2024)",x ="Number of PHH Violations",y ="Borough",fill ="Borough",caption ="Source: NYC Open Data Pool Inspection") +theme_minimal(base_size =12)Final_plot
This ridge plot shows the distribution of Public Health Hazards(PHH) violations across Brooklyn, Manhattan, and Staten Island in 2024. Most inspections in all three boroughs had low PHH violations because the highest part of the shapes is close to 0. Manhattan has the longest shape going to the right, meaning that some inspections had higher numbers of PHH violations. Brooklyn also slightly extends to the right, meaning a few inspections had higher violation too, although not as much as Manhattan. Staten Island has a more concentrated distribution, which indicates that most inspections had lower and similar PHH violation values.
My map shows the pool inspections of some NYC borough in 2024. Each circle represents a pool facility, and larger circles indicate locations with violations. The colors represent the boroughs. The map also shows that some inspections are grouped close to each other, especially in parts of Brooklyn and Manhattan, which suggests that several inspected facilities are located in the same areas. By clicking on the circles, we can view additional information such as the facility name, permit type, inspection type, and number of violations.
While working on this project, I learned more about how pool inspections help protect public health and safety in NYC. Before this project, I did not know that pools were regularly inspected, so it was interesting to explore how inspections are conducted and how violations are recorded.
Background Information
Pool inspections in New York City are conducted by the NYC Department of Health and Mental Hygiene (DOHMH) to make sure public pools follow health and safety regulations. Inspectors check different conditions such as water quality, chlorine and pH levels, cleanliness, safety equipment, warning signs, and possible health hazards. Public pools must also follow rules related to sanitation, emergency equipment, and safe maintenance practices. According to the NYC Open Data Pool Inspection dataset, inspections are completed for both indoor and outdoor pools across New York City. These inspections are important because they help protect swimmers from injuries, unsafe conditions, and water-related illnesses.
In addition, pools in New York City are required to have permits and certified pool operators who are trained in pool safety and water treatment. Health officials regularly monitor pools to make sure they remain clean and safe for the public.