1 Executive Summary

This report is based on the analysis of data provided by the NSW Department of Planning and Environment. The first objective of the analysis is to examine the fluctuations of water quality (proxied by Enterococci concentration) by month. The report also explores the differences in water quality by site. Lastly the report evaluates whether there is a significant difference in the quality of water in the two sites that have the lowest quality of water (Birdwood Park and Bilarong Reserve).

The main findings of the study are as follows.

  • Water quality varies greatly over months. This variability could be due to differences in rainfall in different months.
  • The quality of water varies by site with Birdwood Park and Bilarong Reserve in Narrabeen Lagoon having the lowest median quality of water.
  • The difference in water quality in Birdwood Park and Bilarong Reserve in Narrabeen Lagoon is not statistically significant.

We recommend further research to determine factors associated with water quality in this lagoon.


2 Full Report

2.1 Initial Data Analysis (IDA)

2.1.1 IDA: Source

  • Background:This report is based on the analysis of data provided by the NSW Department of Planning and Environment. The broad aim is to explore some of the factors that determine the quality of water in Narrabeen Lagoon. We proxy water quality using the level of Enterococci concentration per 100ml with higher Enterococci concentrations corresponding to lower water quality.

  • The report has three objectives:

  1. To examine monthly fluctuations in water quality in the Narrabeen Lagoon.
  2. To explore how water quality varies by site of the Narrabeen Lagoon where a sample of water is drawn.
  3. To evaluate whether there is a significant difference in water quality between the two sites of Narrabeen Lagoon with the lowest water quality.
  • The data is valid because the source is come from the official dataset and the beachwatch water quality program has been hold since 1989, the program was expanded and more monitored sites built. The program dataset provides the regular and reliable water quality information.

  • Possible issues include this dataset only shows that the water quality of 2018, actually when I see the report of 2021 water quality report, the water quality is decreasing a little bit, so the research maybe lagging.

  • Potential stakeholders include the swimmer and tourists, the polluted recreational water may influence the health of people, and also the government and environment institution, according to the water quality information, they can enhance the assessment of pollution water treatment plants and if the routine water quality shows that the bacterial level is high, they can give the caution and take action timely to find the resource of pollution.

2.1.2 IDA: Variables

  • Each row represents the daily statistics of all factors of every beach
  • Each column represents the factors(variables) of beaches.
  • The key variables are:
    Table 1: Variables Description
    Variable Description
    Beach_id Identification number for every beach.
    Region Region here the beaches are located.
    Council Local governing counci.l
    Site The monitored sites.
    Longitude The sample collection site latitude.
    Latitude The sample collection site longitide.
    Date Date of sample collection.
    Enterococci (cfu/100ml) Enterococci concentration measured as units per 100 mL of sample (cfu/100 mL)
    Month Month of the Year, January through December
    Day of Week Day of the Week, like Monday, Tuesday, and so on.
## read in data
library(readr)
data = read.csv('beaches.csv')
## missing value
sum(is.na(data))
## [1] 3
data1 = na.omit(data)
## double check cleaning
sum(is.na(data))
## [1] 3
## show classification of variables
class(data)
## [1] "data.frame"
## dimensions of data
dim(data1)
## [1] 1273   11
## classifications of variables
str(data)
## 'data.frame':    1276 obs. of  11 variables:
##  $ X                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ BeachId                : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ Region                 : chr  "Sydney Northern Ocean Beaches" "Sydney Northern Ocean Beaches" "Sydney Northern Ocean Beaches" "Sydney Northern Ocean Beaches" ...
##  $ Council                : chr  "Northern Beaches Council" "Northern Beaches Council" "Northern Beaches Council" "Northern Beaches Council" ...
##  $ Site                   : chr  "Avalon Beach" "Avalon Beach" "Avalon Beach" "Avalon Beach" ...
##  $ Longitude              : num  151 151 151 151 151 ...
##  $ Latitude               : num  -33.6 -33.6 -33.6 -33.6 -33.6 ...
##  $ Date                   : chr  "2018/1/25" "2018/2/7" "2018/2/19" "2018/1/19" ...
##  $ Enterococci..cfu.100ml.: int  6 0 1 1 2 0 33 5 0 0 ...
##  $ Month                  : chr  "January" "February" "February" "January" ...
##  $ Day.of.Week            : chr  "Thursday" "Wednesday" "Monday" "Friday" ...
## matrix form
sapply(data, class)
##                       X                 BeachId                  Region 
##               "integer"               "numeric"             "character" 
##                 Council                    Site               Longitude 
##             "character"             "character"               "numeric" 
##                Latitude                    Date Enterococci..cfu.100ml. 
##               "numeric"             "character"               "integer" 
##                   Month             Day.of.Week 
##             "character"             "character"


2.2 Research Question 1: write research question here

The relationship between the Months and the enterococci cfu per 100mL

#par(mfrow = c(1,2))
month = data$Month
cfu = data$Enterococci..cfu.100ml.
month = factor(month, levels = c('January','February','March','April','May','June','July','August','September','October','November','December'))
## eliminate the extremes
boxplot(cfu ~ month, las = 2, outline = F, ylab = 'cfu per 100mL', xlab = 'Month')

Table shows the dispersion in water quality levels. In this table, we compute the standard deviation of Enterococci Concentration per 100ml after removing outliers (sites with a concentration greater than 100 enterococci_cfu per 100ml).

setNames(aggregate(data$Enterococci..cfu.100ml., list(data$Month), FUN = median, na.rm = TRUE), c('Month','Enterococci..cfu.100ml')) |> formatting_function(caption = "Table : Median Enterococci Concentration by Month")
Table : Median Enterococci Concentration by Month
Month Enterococci..cfu.100ml
April 2.5
August 0.0
December 1.0
February 2.5
January 4.0
July 0.0
June 8.0
March 4.0
May 1.0
November 1.0
October 0.0
September 1.0
data_outlier = data[data$Enterococci..cfu.100ml.<=100,]
setNames(aggregate(data_outlier$Enterococci..cfu.100ml., list(data_outlier$Month), FUN = sd, na.rm = TRUE), c('Month','Enterococci..cfu.100ml')) |> formatting_function(caption = "Table : Dispersion in Water Quality")
Table : Dispersion in Water Quality
Month Enterococci..cfu.100ml
April 9.879512
August 1.603121
December 12.677656
February 12.974284
January 21.381945
July 11.179846
June 20.835502
March 17.802633
May 13.636694
November 8.781690
October 9.931151
September 15.738564
  • Analysis: Figure 1 and first table below shows that the concentration of highest median of Enterococci concentration is highest in June followed by January and March in that order. These three months have the worst water quality. Second table shows the dispersion in water quality levels. In this table, we compute the standard deviation of Enterococci Concentration per 100ml after removing outliers (sites with a concentration greater than 100 enterococci_cfu per 100ml).

  • Additional research and summary: The results show that the variability in water quality is highest in January, June and March in that order. The reason for this instability in water quality in these months is unclear and warrants further research.However, based on previous research such as @nsw_government_2022, @nsw_rainfall_2022, and @lagoon, we can hypothesis that low water quality and the large dispersion observed in the months of January, June and March is due to the high rainfall experienced in the period except for June. The causes of low poor quality in June is not clear and could be researched further.


2.3 Research Question 2: write research question here

The relationship between the sites and the enterococci cfu per 100mL

## Because the names of beach sites are too long, I will use BeachID to replace the variable. 
sites = data$BeachId
#print(n)
boxplot(cfu ~ sites, las = 2, outline = F, ylab = 'cfu per 100mL')

To see the sites with highest Enterococci concentration in liter, we do a table below.

setNames(aggregate(data$Enterococci..cfu.100ml.,list(data$Site), FUN = median, na.rm = TRUE), c('Site','Enterococci..cfu.100ml')) |> formatting_function(caption = "Table : Median Enterococci Concentration by Site")
Table : Median Enterococci Concentration by Site
Site Enterococci..cfu.100ml
Avalon Beach 0.0
Bilarong Reserve (Narrabeen Lagoon) 12.5
Bilgola Beach 0.0
Birdwood Park (Narrabeen Lagoon) 17.5
Bungan Beach 0.0
Collaroy Beach 2.0
Dee Why Beach 2.5
Freshwater Beach 3.0
Long Reef Beach 1.0
Mona Vale Beach 0.0
Newport Beach 0.0
North Curl Curl Beach 2.0
North Narrabeen Beach 0.0
North Steyne Beach 2.5
Palm Beach 0.0
Queenscliff Beach 2.5
Shelly Beach (Manly) 5.0
South Curl Curl Beach 1.0
South Steyne Beach 10.0
Turimetta Beach 0.0
Warriewood Beach 0.0
Whale Beach 0.0
data_outlier = data[data$Enterococci..cfu.100ml.<=100,]
setNames(aggregate(data_outlier$Enterococci..cfu.100ml., list(data_outlier$Site), FUN = sd, na.rm = TRUE), c('Month','Enterococci..cfu.100ml')) |> formatting_function(caption = "Table : Dispersion in Water Quality")
Table : Dispersion in Water Quality
Month Enterococci..cfu.100ml
Avalon Beach 4.886275
Bilarong Reserve (Narrabeen Lagoon) 24.062795
Bilgola Beach 10.755850
Birdwood Park (Narrabeen Lagoon) 20.072969
Bungan Beach 13.195274
Collaroy Beach 13.768847
Dee Why Beach 10.796869
Freshwater Beach 15.295474
Long Reef Beach 13.258973
Mona Vale Beach 14.340725
Newport Beach 9.834852
North Curl Curl Beach 12.962521
North Narrabeen Beach 15.555221
North Steyne Beach 10.136679
Palm Beach 9.042415
Queenscliff Beach 17.181189
Shelly Beach (Manly) 17.073371
South Curl Curl Beach 12.564173
South Steyne Beach 15.734650
Turimetta Beach 5.627927
Warriewood Beach 10.285048
Whale Beach 3.712343
  • Analysis: Most beaches have a very low median Enterococci concentration. However, Birdwood Park and Bilarong Reserve have much higher median concentrations and hence have the lowest water quality. These two sites in Narrabeen Lagoon also exhibit the highest dispersion are more sparse and have the highest (maximum) enterococci concentrations compared to other sites. Again, the concentration of Enterococci is not normally distributed.

Is the concentration of Enterococci in Birdwood Park significantly different from that in Bilarong Reserve The hypotheses for this test are as follows:

H0: There is no difference of means of sample of Bilarong Reserve (Narrabeen Lagoon) and the sample of Birdwood Park (Narrabeen Lagoon).

H1:There is a difference between the means of two samples of Birdwood Park (Narrabeen Lagoon) and Bilarong Reserve (Narrabeen Lagoon).

Further, we assume that;

  1. Continuous Dependent Variable
  2. All observed individuals are independent
  3. Two samples have same spread.
  4. Two sample means follow normal distribution.
  5. There are no outliers.

Further, we use a significance level of 0.05. - Conclusion:
The p-value is 0.207 is bigger than the significance level 0.05. Hence, we fail to reject the null hypothesis.

## select specific rows
diff_1 = data[data$BeachId == 10.1, ]
diff_2 = data[data$BeachId == 10.2, ]
a1 = diff_1$Enterococci..cfu.100ml.
a2 = diff_2$Enterococci..cfu.100ml.
n1 = length(a1)
n2 = length(a2)
m1 = mean(a1)
m2 = mean(a2)
sdp2 = ((n1-1)*sd(a1)^2+(n2-1)*sd(a2)^2)/(n1 + n2 -2)
se = sqrt(sdp2*(1/n1+1/n2))
Teststat = (m1-m2-0)/se

Pvalue = 2*pt(abs(Teststat), n1+n2-2, lower.tail = F)
print(c(Pvalue, Teststat))
## [1] 0.2070082 1.2690498
## Check
c = t.test(a1,a2,var.equal = T)
print(c)
## 
##  Two Sample t-test
## 
## data:  a1 and a2
## t = 1.269, df = 114, p-value = 0.207
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -67.53331 308.29194
## sample estimates:
## mean of x mean of y 
## 171.62069  51.24138
  • Summary: The water quality of most beaches are very good and stable, but two samples of the Narrabeen Lagoon are very good and unstable, we use the hypothesis test find that the difference between two samples, there are probability being true. And then we need to watch out these two samples, I have already researched that there is no explicit pollution source, there are other factors influence the data.


3 References

1.Australian Government - Bureau of Meteorology. (2022). Annual Australian climate statement 2021. Australian Government - Bureau of Meteorology.
2.NSW Government. (2022). State of the beaches 2021–22. NSW Government.
3.NSW Government. (2021). Narrabeen Lagoon: NSW Environment and Heritage. NSW Government.