Water quality in recreational swimming areas is an important environmental and public health concern. Contaminated water can expose swimmers to harmful microorganisms that may cause gastrointestinal illness, skin infections, and other health problems. Because of these risks, environmental agencies routinely monitor bacterial levels in beaches, lakes, rivers, and other recreational waters.
One commonly used bacterial indicator is enterococci bacteria. Enterococci are naturally found in the intestines of humans and animals, so elevated levels in water often suggest fecal contamination from sewage, stormwater runoff, wildlife waste, or failing wastewater systems. Since enterococci levels are strongly associated with swimmer illness, they are commonly used to determine whether water is safe for recreation.
Rainfall may strongly affect bacteria levels in water. During rain events, runoff can wash pet waste, sediments, sewage overflow, and other contaminants into nearby waterways. Heavy rainfall may therefore increase bacterial concentrations and reduce water quality.
The data used in this project combines water quality measurements with weather observations collected on matching dates. The primary response variable (dependent variable) is enterococci bacteria concentration, measured in colony-forming units per 100 milliliters (CFU/100 mL). The main explanatory variable (independent variable) is rainfall amount, measured in millimeters (mm) of precipitation.
#Import datasets
water_quality <- readr::read_csv(
'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-05-20/water_quality.csv'
)
weather <- readr::read_csv(
'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-05-20/weather.csv'
)
# Include map of sampling locations
knitr::include_graphics("~/Desktop/BIO 320 - Lister/Final Project/BeachwatchMap.png")
Does the level of rainfall affect the bacteria levels in the water?
# Merge datasets by date
combined_data <- water_quality %>%
left_join(weather, by = "date") %>%
select(swim_site, date, enterococci_cfu_100ml,
water_temperature_c, precipitation_mm) %>%
drop_na(enterococci_cfu_100ml, precipitation_mm)
# Create a new variables
combined_data <- combined_data %>%
mutate(
rainy_day = if_else(precipitation_mm > 0, "Rainy", "Non-rainy"),
rainy_day = factor(rainy_day),
log_enterococci = log10(enterococci_cfu_100ml + 1)
)
# Overall summary statistics
summary_stats <- combined_data %>%
summarise(
n = n(),
mean_enterococci = mean(enterococci_cfu_100ml),
median_enterococci = median(enterococci_cfu_100ml),
sd_enterococci = sd(enterococci_cfu_100ml),
se_enterococci = sd(enterococci_cfu_100ml) / sqrt(n()),
mean_rain = mean(precipitation_mm),
median_rain = median(precipitation_mm),
sd_rain = sd(precipitation_mm),
se_rain = sd(precipitation_mm) / sqrt(n())
)
kable(summary_stats, digits = 2)
| n | mean_enterococci | median_enterococci | sd_enterococci | se_enterococci | mean_rain | median_rain | sd_rain | se_rain |
|---|---|---|---|---|---|---|---|---|
| 123223 | 116.77 | 4 | 4714.63 | 13.43 | 1.94 | 0.1 | 5.61 | 0.02 |
# Grouped summary statistics
group_stats <- combined_data %>%
group_by(rainy_day) %>%
summarise(
n = n(),
mean_enterococci = mean(enterococci_cfu_100ml),
median_enterococci = median(enterococci_cfu_100ml),
sd_enterococci = sd(enterococci_cfu_100ml),
se_enterococci = sd(enterococci_cfu_100ml) / sqrt(n())
)
kable(group_stats, digits = 2)
| rainy_day | n | mean_enterococci | median_enterococci | sd_enterococci | se_enterococci |
|---|---|---|---|---|---|
| Non-rainy | 60057 | 53.69 | 2 | 1462.13 | 5.97 |
| Rainy | 63166 | 176.75 | 6 | 6428.20 | 25.58 |
# Histogram of Enterococci Levels
ggplot(combined_data, aes(x = enterococci_cfu_100ml)) +
geom_histogram(fill = "steelblue", bins = 30) +
theme_minimal() +
labs(
title = "Distribution of Enterococci Bacteria Levels",
x = "Enterococci (CFU per 100mL)",
y = "Frequency"
)
# Histogram of Log-Transformed Data
ggplot(combined_data, aes(x = log_enterococci)) +
geom_histogram(fill = "darkgreen", bins = 30) +
theme_minimal() +
labs(
title = "Distribution of Log-Transformed Enterococci Levels",
x = "Log10 Enterococci",
y = "Frequency"
)
# Boxplot: Rainy vs Non-Rainy Days
ggplot(combined_data, aes(x = rainy_day, y = log_enterococci, fill = rainy_day)) +
geom_boxplot() +
theme_minimal() +
labs(
title = "Bacteria Levels on Rainy vs Non-Rainy Days",
x = "Day Type",
y = "Log Enterococci"
)
# Scatterplot: Rainfall vs Bacteria
ggplot(combined_data, aes(x = precipitation_mm, y = log_enterococci)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = TRUE) +
theme_minimal() +
labs(
title = "Rainfall vs Bacteria Levels",
x = "Rainfall (mm)",
y = "Log Enterococci"
)
The bacteria data were not normally distributed, so a log transformation was applied before statistical testing.
Independent Sample t-test: comparing bacteria levels between rainy and non-rainy days. - Null Hypothesis: Mean bacteria levels are equal on rainy and non-rainy days. - Alternative Hypothesis: Mean bacteria levels differ between rainy and non-rainy days.
Linear Regression: this model tests whether rainfall amount predicts bacteria levels. - Null Hypothesis: Rainfall has no relationship with bacteria levels. - Alternative Hypothesis: Rainfall significantly affects bacteria levels.
Multiple Regression (Improved Model): this model controls for water temperature and swim site.
These diagnostic plots were used to assess linearity, normality of residuals, constant variance, and influential outliers.
# Independent Sample t-test
t_test_results <- t.test(log_enterococci ~ rainy_day, data = combined_data)
t_test_results
##
## Welch Two Sample t-test
##
## data: log_enterococci by rainy_day
## t = -61.899, df = 121492, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Non-rainy and group Rainy is not equal to 0
## 95 percent confidence interval:
## -0.2742393 -0.2574052
## sample estimates:
## mean in group Non-rainy mean in group Rainy
## 0.7014733 0.9672956
# Linear Regression
rain_model <- lm(log_enterococci ~ precipitation_mm, data = combined_data)
summary(rain_model)
##
## Call:
## lm(formula = log_enterococci ~ precipitation_mm, data = combined_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2599 -0.7903 -0.0962 0.4601 4.9433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7902765 0.0022793 346.72 <2e-16 ***
## precipitation_mm 0.0244279 0.0003842 63.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.756 on 123221 degrees of freedom
## Multiple R-squared: 0.03177, Adjusted R-squared: 0.03176
## F-statistic: 4043 on 1 and 123221 DF, p-value: < 2.2e-16
# Multiple Regression
multi_model <- lm(
log_enterococci ~ precipitation_mm + water_temperature_c + swim_site,
data = combined_data
)
summary(multi_model)
##
## Call:
## lm(formula = log_enterococci ~ precipitation_mm + water_temperature_c +
## swim_site, data = combined_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4039 -0.4391 -0.1212 0.3576 4.6005
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.2055298 0.0257566 7.980
## precipitation_mm 0.0203184 0.0004795 42.377
## water_temperature_c 0.0010228 0.0002097 4.877
## swim_siteBalmoral Baths 0.5432819 0.0382200 14.215
## swim_siteBilarong Reserve 1.0298891 0.0361783 28.467
## swim_siteBilgola Beach 0.1109708 0.0360188 3.081
## swim_siteBoat Harbour 0.8205223 0.0354658 23.136
## swim_siteBondi Beach 0.5801112 0.0338422 17.142
## swim_siteBronte Beach 0.6609371 0.0331994 19.908
## swim_siteBungan Beach 0.0521869 0.0359199 1.453
## swim_siteCabarita Beach 0.5797650 0.0316765 18.303
## swim_siteCallan Park Seawall 0.6770511 0.0478790 14.141
## swim_siteCamp Cove 0.1985926 0.0420930 4.718
## swim_siteChinamans Beach 0.3085838 0.0318677 9.683
## swim_siteChiswick Baths 0.5712400 0.0401415 14.231
## swim_siteClifton Gardens 0.4590599 0.0383698 11.964
## swim_siteClontarf Pool 0.5568633 0.0386553 14.406
## swim_siteClovelly Beach 0.4459010 0.0339412 13.137
## swim_siteCollaroy Beach 0.3795577 0.0360046 10.542
## swim_siteCoogee Beach 0.8508519 0.0331617 25.658
## swim_siteDarling Harbour 1.2447456 0.0392951 31.677
## swim_siteDavidson Reserve 0.8700922 0.0380555 22.864
## swim_siteDawn Fraser Pool 0.6343674 0.0380129 16.688
## swim_siteDee Why Beach 0.2819881 0.0359905 7.835
## swim_siteEdwards Beach 0.3578920 0.0390996 9.153
## swim_siteElouera Beach 0.2425258 0.0353390 6.863
## swim_siteFairlight Beach 0.3283249 0.0394707 8.318
## swim_siteForty Baskets Pool 0.4440417 0.0389556 11.399
## swim_siteFreshwater Beach 0.4921560 0.0360047 13.669
## swim_siteGordons Bay (East) 0.3961165 0.0353392 11.209
## swim_siteGreenhills Beach 0.1750181 0.0355296 4.926
## swim_siteGreenwich Baths 0.6186080 0.0379926 16.282
## swim_siteGurney Crescent Baths 0.5178153 0.0396527 13.059
## swim_siteHayes Street Beach 0.7866814 0.0389560 20.194
## swim_siteHenley Baths (Kelly Street Baths) 0.8449548 0.1073281 7.873
## swim_siteLittle Bay Beach 0.7113614 0.0340523 20.890
## swim_siteLittle Manly Cove 0.5366109 0.0384122 13.970
## swim_siteLittle Sirius Cove 1.3246587 0.0612569 21.625
## swim_siteLong Reef Beach 0.1766427 0.0314278 5.621
## swim_siteMalabar Beach 0.9119533 0.0337273 27.039
## swim_siteManly Cove 0.4678243 0.0315187 14.843
## swim_siteMaroubra Beach 0.4107633 0.0313343 13.109
## swim_siteMegalong Creek 1.7648720 0.0646351 27.305
## swim_siteMona Vale Beach 0.1490143 0.0313784 4.749
## swim_siteMurray Rose Pool 0.7422783 0.0381365 19.464
## swim_siteNarrabeen Lagoon (Birdwood Park) 1.0449724 0.0364190 28.693
## swim_siteNewport Beach 0.1718602 0.0359059 4.786
## swim_siteNielsen Park 0.2176661 0.0316517 6.877
## swim_siteNorth Cronulla Beach 0.2788387 0.0350723 7.950
## swim_siteNorth Curl Curl Beach 0.3345735 0.0349907 9.562
## swim_siteNorth Narrabeen Beach 0.2255751 0.0360474 6.258
## swim_siteNorth Steyne Beach 0.4495448 0.0360190 12.481
## swim_siteNorthbridge Baths 0.6648958 0.0385006 17.270
## swim_siteOak Park Beach 0.1779708 0.0353018 5.041
## swim_sitePalm Beach 0.1042418 0.0313256 3.328
## swim_siteParsley Bay 0.6588720 0.0382411 17.229
## swim_sitePenrith Beach 0.7177563 0.0991408 7.240
## swim_siteQueenscliff Beach 0.6885395 0.0348990 19.729
## swim_siteRose Bay Beach 0.8678935 0.0386326 22.465
## swim_siteSangrado Baths 1.2623801 0.1182144 10.679
## swim_siteShelly Beach (Manly) 0.6201551 0.0313744 19.766
## swim_siteShelly Beach (Sutherland) 0.2209829 0.0352648 6.266
## swim_siteSouth Cronulla Beach 0.4556274 0.0352648 12.920
## swim_siteSouth Curl Curl Beach 0.1972558 0.0360332 5.474
## swim_siteSouth Maroubra Beach 0.4030818 0.0350374 11.504
## swim_siteSouth Maroubra Rockpool 0.7170564 0.0353392 20.291
## swim_siteSouth Steyne Beach 0.7079300 0.0352772 20.068
## swim_siteTamarama Beach 0.5705754 0.0337188 16.922
## swim_siteTambourine Bay 0.8881127 0.0382210 23.236
## swim_siteTurimetta Beach 0.0916142 0.0361197 2.536
## swim_siteWanda Beach 0.2122272 0.0352893 6.014
## swim_siteWarriewood Beach 0.1423594 0.0360762 3.946
## swim_siteWatsons Bay 0.4968795 0.0384339 12.928
## swim_siteWentworth Falls Lake - Beach 1.3006895 0.0630730 20.622
## swim_siteWentworth Falls Lake - Jetty 1.6645338 0.0632875 26.301
## swim_siteWhale Beach -0.0134556 0.0359199 -0.375
## swim_siteWindsor Beach 1.3817573 0.0635073 21.757
## swim_siteWoodford Bay 0.6837651 0.0388387 17.605
## swim_siteWoolwich Baths 0.8216067 0.0379131 21.671
## swim_siteYarramundi Reserve 1.6777750 0.0641735 26.144
## swim_siteYosemite Creek - Minnehaha Falls 1.6468562 0.0682587 24.127
## Pr(>|t|)
## (Intercept) 1.50e-15 ***
## precipitation_mm < 2e-16 ***
## water_temperature_c 1.08e-06 ***
## swim_siteBalmoral Baths < 2e-16 ***
## swim_siteBilarong Reserve < 2e-16 ***
## swim_siteBilgola Beach 0.002065 **
## swim_siteBoat Harbour < 2e-16 ***
## swim_siteBondi Beach < 2e-16 ***
## swim_siteBronte Beach < 2e-16 ***
## swim_siteBungan Beach 0.146266
## swim_siteCabarita Beach < 2e-16 ***
## swim_siteCallan Park Seawall < 2e-16 ***
## swim_siteCamp Cove 2.39e-06 ***
## swim_siteChinamans Beach < 2e-16 ***
## swim_siteChiswick Baths < 2e-16 ***
## swim_siteClifton Gardens < 2e-16 ***
## swim_siteClontarf Pool < 2e-16 ***
## swim_siteClovelly Beach < 2e-16 ***
## swim_siteCollaroy Beach < 2e-16 ***
## swim_siteCoogee Beach < 2e-16 ***
## swim_siteDarling Harbour < 2e-16 ***
## swim_siteDavidson Reserve < 2e-16 ***
## swim_siteDawn Fraser Pool < 2e-16 ***
## swim_siteDee Why Beach 4.78e-15 ***
## swim_siteEdwards Beach < 2e-16 ***
## swim_siteElouera Beach 6.83e-12 ***
## swim_siteFairlight Beach < 2e-16 ***
## swim_siteForty Baskets Pool < 2e-16 ***
## swim_siteFreshwater Beach < 2e-16 ***
## swim_siteGordons Bay (East) < 2e-16 ***
## swim_siteGreenhills Beach 8.42e-07 ***
## swim_siteGreenwich Baths < 2e-16 ***
## swim_siteGurney Crescent Baths < 2e-16 ***
## swim_siteHayes Street Beach < 2e-16 ***
## swim_siteHenley Baths (Kelly Street Baths) 3.54e-15 ***
## swim_siteLittle Bay Beach < 2e-16 ***
## swim_siteLittle Manly Cove < 2e-16 ***
## swim_siteLittle Sirius Cove < 2e-16 ***
## swim_siteLong Reef Beach 1.91e-08 ***
## swim_siteMalabar Beach < 2e-16 ***
## swim_siteManly Cove < 2e-16 ***
## swim_siteMaroubra Beach < 2e-16 ***
## swim_siteMegalong Creek < 2e-16 ***
## swim_siteMona Vale Beach 2.05e-06 ***
## swim_siteMurray Rose Pool < 2e-16 ***
## swim_siteNarrabeen Lagoon (Birdwood Park) < 2e-16 ***
## swim_siteNewport Beach 1.70e-06 ***
## swim_siteNielsen Park 6.19e-12 ***
## swim_siteNorth Cronulla Beach 1.90e-15 ***
## swim_siteNorth Curl Curl Beach < 2e-16 ***
## swim_siteNorth Narrabeen Beach 3.94e-10 ***
## swim_siteNorth Steyne Beach < 2e-16 ***
## swim_siteNorthbridge Baths < 2e-16 ***
## swim_siteOak Park Beach 4.64e-07 ***
## swim_sitePalm Beach 0.000876 ***
## swim_siteParsley Bay < 2e-16 ***
## swim_sitePenrith Beach 4.56e-13 ***
## swim_siteQueenscliff Beach < 2e-16 ***
## swim_siteRose Bay Beach < 2e-16 ***
## swim_siteSangrado Baths < 2e-16 ***
## swim_siteShelly Beach (Manly) < 2e-16 ***
## swim_siteShelly Beach (Sutherland) 3.73e-10 ***
## swim_siteSouth Cronulla Beach < 2e-16 ***
## swim_siteSouth Curl Curl Beach 4.42e-08 ***
## swim_siteSouth Maroubra Beach < 2e-16 ***
## swim_siteSouth Maroubra Rockpool < 2e-16 ***
## swim_siteSouth Steyne Beach < 2e-16 ***
## swim_siteTamarama Beach < 2e-16 ***
## swim_siteTambourine Bay < 2e-16 ***
## swim_siteTurimetta Beach 0.011203 *
## swim_siteWanda Beach 1.82e-09 ***
## swim_siteWarriewood Beach 7.96e-05 ***
## swim_siteWatsons Bay < 2e-16 ***
## swim_siteWentworth Falls Lake - Beach < 2e-16 ***
## swim_siteWentworth Falls Lake - Jetty < 2e-16 ***
## swim_siteWhale Beach 0.707959
## swim_siteWindsor Beach < 2e-16 ***
## swim_siteWoodford Bay < 2e-16 ***
## swim_siteWoolwich Baths < 2e-16 ***
## swim_siteYarramundi Reserve < 2e-16 ***
## swim_siteYosemite Creek - Minnehaha Falls < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6428 on 48103 degrees of freedom
## (75039 observations deleted due to missingness)
## Multiple R-squared: 0.2114, Adjusted R-squared: 0.2101
## F-statistic: 161.2 on 80 and 48103 DF, p-value: < 2.2e-16
# Assumption Checks
par(mfrow = c(2,2))
plot(rain_model)
The t-test compares bacteria levels on rainy and non-rainy days. If the p-value is less than 0.05, then rainfall status significantly affects bacteria concentrations.
The regression model tests whether rainfall amount predicts bacteria levels. A positive slope indicates bacteria levels increase as rainfall increases. The R-squared value describes how much variation in bacteria levels is explained by rainfall.
If the multiple regression model remains significant after controlling for temperature and swim site, then rainfall likely has an independent effect on bacterial contamination.
This analysis is important because rainfall runoff may carry fecal matter, pollutants, and waste into recreational waters, increasing health risks for swimmers. Understanding this relationship can help improve beach advisories and public safety decisions.
Limitations that arise were other environmental factors such as wind, sunlight, currents, and wildlife were not included. Rainfall effects may vary across beaches. This is observational data, so direct causation cannot be confirmed. Lastly, bacteria levels naturally fluctuate over time.
Overall, this project examined whether rainfall affects bacteria levels in recreational water. If significant results are found, the evidence suggests rainfall is associated with increased enterococci concentrations and poorer water quality. Monitoring rainfall may help predict unsafe swimming conditions and protect public health.
Water Quality Dataset. TidyTuesday (2025-05-20). Weather Dataset. TidyTuesday (2025-05-20). U.S. Environmental Protection Agency (EPA). Recreational Water Quality Criteria. Wikimedia Commons or personal project image: BeachwatchMap.png
We acknowledge the use of ChatGPT (chatgpt.com) to assist with revising the original code, identifying errors, and improving the graphs and statistical tests used in this project. It was used to help streamline sections of code and identify potential weaknesses in the analysis. The outputs were used to correct syntax errors and refine the workflow before running the final analysis on the complete dataset.