Image 1: City of Sydney Beaches are located in highly urbanised and densely populated areas, often along Sydney Harbour rather than the open ocean. These sites are surrounded by residential, commercial, and tourism infrastructure, including roads, stormwater drains, and waterfront developments.

Image 2: The Northern Beaches Council are located along Sydney’s open ocean coastline and are generally less urbanised than those in the City of Sydney.

Part 1 - Lets Dive In!

Sydney beaches were in the news during the summer of 2025 due to heavy rainfall raising concerns about water safety. Intense rainfall can wash stormwater, sewage overflows, and other pollutants into the ocean, temporarily lowering water quality and making swimming less safe. It can also increase runoff from streets, drains, and creeks, which may raise bacterial contamination and overall pollution levels.

The dataset used in this study was obtained from the New South Wales State Government’s Beachwatch program. Beachwatch and its partners monitor water quality at swimming sites to ensure that recreational water environments are managed as safely as possible. The dataset includes both water quality measurements and historical weather data from 1991 to 2025.

This study aims to investigate the relationship between rainfall and bacterial contamination at Sydney beaches. Specifically, it examines whether water quality, measured using E. coli (Enterococci) levels, differs by precipitation levels and by location (council). It is hypothesized that higher rainfall will lead to increased bacterial levels due to stormwater runoff and sewage overflow. Using R, trends and relationships between rainfall and water quality are analyzed across selected Sydney regions.

water_quality <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-05-20/water_quality.csv')
## Rows: 123530 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): region, council, swim_site
## dbl  (5): enterococci_cfu_100ml, water_temperature_c, conductivity_ms_cm, la...
## date (1): date
## time (1): time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weather <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-05-20/weather.csv')
## Rows: 12538 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (5): max_temp_C, min_temp_C, precipitation_mm, latitude, longitude
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Merged the two datasets 
data_joined <- water_quality %>%
  left_join(weather, by = "date")
## Use dplyr package to subset and filter the dataset
## 123,530 across multiple regions and multiple sites within each region is too much data to handle
## Instead, focus within Sydney, and focus on two different beach sites: Sydney Harbour and the Northern Beaches. 
sydney.df <- data_joined %>%
  filter(
    region  == "Sydney Harbour", #include ONLY beaches in the region of Sydney Harbour
    council %in% c("The City of Sydney", "Northern Beaches Council"), #only focus on two different council areas
    enterococci_cfu_100ml > 0, 
    format(date, "%Y") %in% c("2018", "2019", "2020", "2021", "2022")  # only look at 2018 - 2022
  ) %>%
  mutate(
    log_ecoli  = log(enterococci_cfu_100ml), #log transform E.coli concentrations
    log_precip = log1p(precipitation_mm),  # log1p handles zero-rain days safely
    month      = format(date, "%m"),        ## extract them onth
    year       = format(date, "%Y"),        # 4-digit year is clearer than 2-digit
    rain_bin   = cut(
      precipitation_mm,
      breaks = c(-Inf, 1, 5, 10, 20, Inf),   # -Inf catches exact zeros if any remain
      labels = c("≤1", "1–5", "5–10", "10–20", "20+"),
      right  = TRUE
    )
  )
#Make sure you look at your data structure
head(sydney.df) 
## # A tibble: 6 × 20
##   region         council        swim_site date       time  enterococci_cfu_100ml
##   <chr>          <chr>          <chr>     <date>     <tim>                 <dbl>
## 1 Sydney Harbour The City of S… Darling … 2022-12-29 07:55                    45
## 2 Sydney Harbour Northern Beac… Manly Co… 2022-12-29 09:00                     6
## 3 Sydney Harbour Northern Beac… Davidson… 2022-12-29 09:50                     1
## 4 Sydney Harbour Northern Beac… Gurney C… 2022-12-29 09:31                     8
## 5 Sydney Harbour Northern Beac… Forty Ba… 2022-12-29 09:07                     2
## 6 Sydney Harbour Northern Beac… Little M… 2022-12-21 10:42                     1
## # ℹ 14 more variables: water_temperature_c <dbl>, conductivity_ms_cm <dbl>,
## #   latitude.x <dbl>, longitude.x <dbl>, max_temp_C <dbl>, min_temp_C <dbl>,
## #   precipitation_mm <dbl>, latitude.y <dbl>, longitude.y <dbl>,
## #   log_ecoli <dbl>, log_precip <dbl>, month <chr>, year <chr>, rain_bin <fct>

These two datasets, water quality and historical weather data, will help us answer an important public health question: “Does Water Quality differ by percipitation levels and by site?” To investigate this question, we first combined and filtered the data.

Part 2 - Exploring Our Data

In the first histogram, we will see the raw data levels of E.coli by two councils, both the Northern Beaches Council and the City of Sydney show a right-skewed distribution. This means that most observations have low E. coli levels, with a small number of very high values creating a long tail to the right. This skewness suggests that a log transformation is appropriate to normalize the data.

#1) Using ggplot2 and facet_wrap(), create a histogram of the raw E.coli data facet_wrapped by council
ggplot(sydney.df, aes(x = enterococci_cfu_100ml, fill = council)) +
  geom_histogram(bins = 30, color = "black") +
  facet_wrap(~ council) +
  scale_fill_manual(values = wes_palette("GrandBudapest1")) +
  labs(
    title = "Raw E. coli Distribution by Council",
    x = "E. coli (CFU/100 mL)",
    y = "Count"
  )

After applying a log transformation to the raw E. coli data, the histogram becomes more evenly distributed, although it is still not perfectly normal. The Northern Beaches Council data still shows a right skew. In contrast, the City of Sydney distribution appears closer to a normal distribution, suggesting that the transformation was more effective for that region.

#2) Using ggplot2 and facet_wrap(), create a histogram of the log-transformed E.coli data facet_wrapped by council
ggplot(sydney.df, aes(x = log_ecoli, fill = council)) +
  geom_histogram(bins = 30, color = "black") +
  facet_wrap(~ council) +
  scale_fill_manual(values = wes_palette("GrandBudapest1")) +
  labs(
    title = "Log E. coli Distribution by Council",
    x = "E. coli (CFU/100 mL)",
    y = "Count"
  ) 

A histogram was also created to visualize precipitation levels across the two regions. The distribution of precipitation is right-skewed in both regions, indicating that most days have low rainfall, with a few days experiencing much higher amounts. After applying a log transformation, the data becomes more evenly distributed; however, a slight right skew still remains in both histograms.

library(RColorBrewer)

ggplot(sydney.df, aes(x = precipitation_mm, fill = council)) +
  geom_histogram(bins = 30, color = "black") +
  facet_wrap(~ council) +
  scale_fill_brewer(palette = "Set3") +
  labs(
    title = "Raw Precipitation Distribution by Council",
    x = "Precipitation (mm)",
    y = "Count"
  )

Log Transformation for Levels of precipitation:

library(RColorBrewer)

ggplot(sydney.df, aes(x = log_precip, fill = council)) +
  geom_histogram(bins = 30, color = "black") +
  facet_wrap(~ council) +
  scale_fill_brewer(palette = "Set3") +
  labs(
    title = "Log Precipitation Distribution by Council",
    x = "Precipitation (mm)",
    y = "Count"
  )

The boxplot shows the distribution of log-transformed E. coli levels across different rainfall categories for both councils.There is a slight upward trend in bacterial levels as rainfall increases. For both the Northern Beaches Council and the City of Sydney, the median log E. coli values rise from the lowest rainfall category (≤1 mm) to the highest (20+ mm). This suggests that higher rainfall is associated with increased bacterial contamination. Additionally, the City of Sydney consistently shows higher median E. coli levels compared to the Northern Beaches Council across all rainfall categories.Notably, outliers are present in the lower rainfall categories (≤1 mm and 1–5 mm), indicating that high bacterial levels can still occur even during periods of low rainfall.The spread of the data also increases with higher rainfall levels, as seen by the wider boxes and longer whiskers in the higher rainfall categories. This indicates greater variability in bacterial levels during heavy rainfall events.

library(RColorBrewer)

ggplot(sydney.df, aes(x = rain_bin, y = log_ecoli, fill = council)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Paired") +
  labs(
    title = "Log E. coli by Rainfall Category and Council",
    x = "Rainfall Category (mm)",
    y = "Log E. coli"
  ) +
  theme_minimal()

Part 3 - Statistical Tests (Inferential Statistics):

A Q-Q plot was used to assess the normality of the model residuals. The points generally follow the best fitted line, indicating that the residuals are approximately normally distributed. Minor deviations at the tails may be present, but overall the normality assumption is reasonably satisfied.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
model <- lm(log_ecoli ~ rain_bin * council, data = sydney.df)
qqnorm(residuals(model))
qqline(residuals(model), col = "red")

A two-way ANOVA was used because the analysis includes two independent variables: rainfall category and council. This method allows us to examine not only the individual effects of each factor on E. coli levels, but also whether there is an interaction between them.

#6) Write a lm() and car::Anova() for this data
library(car)
model <- lm(log_ecoli ~ rain_bin * council, data = sydney.df)
Anova(model, type = 2)
## Anova Table (Type II tests)
## 
## Response: log_ecoli
##                  Sum Sq   Df F value Pr(>F)    
## rain_bin          394.3    4 32.3490 <2e-16 ***
## council           237.8    1 78.0432 <2e-16 ***
## rain_bin:council    4.8    4  0.3967 0.8111    
## Residuals        3367.1 1105                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part 4 - What Do The Stats Mean?

Based on the results of this study, rainfall appears to have a strong impact on water quality at Sydney beaches. From both the descriptive statistics and the ANOVA results, we can see that higher rainfall levels are associated with higher E. coli concentrations. The ANOVA showed that rainfall category had a highly significant effect on bacterial levels (F = 32.35, p < 0.001), supporting the idea rainfall can lead to pollution events effecting the water. The boxplot showed a clear trend where bacterial levels increased as rainfall increased. The ANOVA results confirmed that this relationship is statistically significant. In addition, the council variable was also highly significant (F = 78.04, p < 0.001), which means that water quality differs between the Northern Beaches Council and the City of Sydney. This suggests that geographic or environmental differences may influence contamination levels. However, the interaction between rainfall and council was not statistically significant (F = 0.40, p = 0.8111). This means that rainfall affects both regions in a similar way, even though the overall levels of contamination may differ between them. In other words, increases in rainfall lead to higher bacterial levels regardless of the council. There were also some outliers in the lower rainfall categories, showing that high bacteria levels can still occur even when rainfall is low. This suggests that other factors, such as local pollution sources or environmental conditions, may also influence water quality. One limitation of this study is that only selected years and regions were analyzed.Overall, these findings highlight the importance of monitoring rainfall when assessing beach safety.

Part 5 - Conclusion

In conclusion, this study examined the relationship between rainfall and water quality at Sydney beaches using historical data. The results showed that increased rainfall is associated with higher levels of E. coli, supporting the that rainfall negatively impacts water quality through many factors. Both rainfall and council were found to be statistically significant factors, indicating that environmental conditions and geographic location play an important role in bacterial contamination levels.However, the high p-value indicates a lack of interaction between council and rainfall levels. This suggests that rainfall influences both regions in a similar way.Overall, this analysis highlights the importance of monitoring rainfall when assessing beach safety. These findings can help inform public health decisions and raise awareness about the potential risks of swimming after heavy rainfall.

Part 6 - References

  1. U.S. Environmental Protection Agency. (2013, December 17). Recreational water quality criteria and methods. https://www.epa.gov/wqc/recreational-water-quality-criteria-and-methods
  2. Data Science Learning Community. (2025, May 19). Water quality at Sydney beaches. GitHub. https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-05-20/readme.md