This report is based on the analysis of data provided by the NSW Department of Planning and Environment. The first objective of the analysis is to examine the fluctuations of water quality (proxied by Enterococci concentration) by month. The report also explores the differences in water quality by site. Lastly the report evaluates whether there is a significant difference in the quality of water in the two sites that have the lowest quality of water (Birdwood Park and Bilarong Reserve).
The main findings of the study are as follows.
We recommend further research to determine factors associated with water quality in this lagoon.
Background:This report is based on the analysis of data provided by the NSW Department of Planning and Environment. The broad aim is to explore some of the factors that determine the quality of water in Narrabeen Lagoon. We proxy water quality using the level of Enterococci concentration per 100ml with higher Enterococci concentrations corresponding to lower water quality.
The report has three objectives:
The data is valid because the source is come from the official dataset and the beachwatch water quality program has been hold since 1989, the program was expanded and more monitored sites built. The program dataset provides the regular and reliable water quality information.
Possible issues include this dataset only shows that the water quality of 2018, actually when I see the report of 2021 water quality report, the water quality is decreasing a little bit, so the research maybe lagging.
Potential stakeholders include the swimmer and tourists, the polluted recreational water may influence the health of people, and also the government and environment institution, according to the water quality information, they can enhance the assessment of pollution water treatment plants and if the routine water quality shows that the bacterial level is high, they can give the caution and take action timely to find the resource of pollution.
| Variable | Description |
|---|---|
| Beach_id | Identification number for every beach. |
| Region | Region here the beaches are located. |
| Council | Local governing counci.l |
| Site | The monitored sites. |
| Longitude | The sample collection site latitude. |
| Latitude | The sample collection site longitide. |
| Date | Date of sample collection. |
| Enterococci (cfu/100ml) | Enterococci concentration measured as units per 100 mL of sample (cfu/100 mL) |
| Month | Month of the Year, January through December |
| Day of Week | Day of the Week, like Monday, Tuesday, and so on. |
## read in data
library(readr)
data = read.csv('beaches.csv')
## missing value
sum(is.na(data))
## [1] 3
data1 = na.omit(data)
## double check cleaning
sum(is.na(data))
## [1] 3
## show classification of variables
class(data)
## [1] "data.frame"
## dimensions of data
dim(data1)
## [1] 1273 11
## classifications of variables
str(data)
## 'data.frame': 1276 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ BeachId : num 4 4 4 4 4 4 4 4 4 4 ...
## $ Region : chr "Sydney Northern Ocean Beaches" "Sydney Northern Ocean Beaches" "Sydney Northern Ocean Beaches" "Sydney Northern Ocean Beaches" ...
## $ Council : chr "Northern Beaches Council" "Northern Beaches Council" "Northern Beaches Council" "Northern Beaches Council" ...
## $ Site : chr "Avalon Beach" "Avalon Beach" "Avalon Beach" "Avalon Beach" ...
## $ Longitude : num 151 151 151 151 151 ...
## $ Latitude : num -33.6 -33.6 -33.6 -33.6 -33.6 ...
## $ Date : chr "2018/1/25" "2018/2/7" "2018/2/19" "2018/1/19" ...
## $ Enterococci..cfu.100ml.: int 6 0 1 1 2 0 33 5 0 0 ...
## $ Month : chr "January" "February" "February" "January" ...
## $ Day.of.Week : chr "Thursday" "Wednesday" "Monday" "Friday" ...
## matrix form
sapply(data, class)
## X BeachId Region
## "integer" "numeric" "character"
## Council Site Longitude
## "character" "character" "numeric"
## Latitude Date Enterococci..cfu.100ml.
## "numeric" "character" "integer"
## Month Day.of.Week
## "character" "character"
The relationship between the Months and the enterococci cfu per 100mL
#par(mfrow = c(1,2))
month = data$Month
cfu = data$Enterococci..cfu.100ml.
month = factor(month, levels = c('January','February','March','April','May','June','July','August','September','October','November','December'))
## eliminate the extremes
boxplot(cfu ~ month, las = 2, outline = F, ylab = 'cfu per 100mL', xlab = 'Month')
Table shows the dispersion in water quality levels. In this table, we
compute the standard deviation of Enterococci Concentration per 100ml
after removing outliers (sites with a concentration greater than 100
enterococci_cfu per 100ml).
setNames(aggregate(data$Enterococci..cfu.100ml., list(data$Month), FUN = median, na.rm = TRUE), c('Month','Enterococci..cfu.100ml')) |> formatting_function(caption = "Table : Median Enterococci Concentration by Month")
| Month | Enterococci..cfu.100ml |
|---|---|
| April | 2.5 |
| August | 0.0 |
| December | 1.0 |
| February | 2.5 |
| January | 4.0 |
| July | 0.0 |
| June | 8.0 |
| March | 4.0 |
| May | 1.0 |
| November | 1.0 |
| October | 0.0 |
| September | 1.0 |
data_outlier = data[data$Enterococci..cfu.100ml.<=100,]
setNames(aggregate(data_outlier$Enterococci..cfu.100ml., list(data_outlier$Month), FUN = sd, na.rm = TRUE), c('Month','Enterococci..cfu.100ml')) |> formatting_function(caption = "Table : Dispersion in Water Quality")
| Month | Enterococci..cfu.100ml |
|---|---|
| April | 9.879512 |
| August | 1.603121 |
| December | 12.677656 |
| February | 12.974284 |
| January | 21.381945 |
| July | 11.179846 |
| June | 20.835502 |
| March | 17.802633 |
| May | 13.636694 |
| November | 8.781690 |
| October | 9.931151 |
| September | 15.738564 |
Analysis: Figure 1 and first table below shows that the concentration of highest median of Enterococci concentration is highest in June followed by January and March in that order. These three months have the worst water quality. Second table shows the dispersion in water quality levels. In this table, we compute the standard deviation of Enterococci Concentration per 100ml after removing outliers (sites with a concentration greater than 100 enterococci_cfu per 100ml).
Additional research and summary: The results show that the variability in water quality is highest in January, June and March in that order. The reason for this instability in water quality in these months is unclear and warrants further research.However, based on previous research such as @nsw_government_2022, @nsw_rainfall_2022, and @lagoon, we can hypothesis that low water quality and the large dispersion observed in the months of January, June and March is due to the high rainfall experienced in the period except for June. The causes of low poor quality in June is not clear and could be researched further.
The relationship between the sites and the enterococci cfu per 100mL
## Because the names of beach sites are too long, I will use BeachID to replace the variable.
sites = data$BeachId
#print(n)
boxplot(cfu ~ sites, las = 2, outline = F, ylab = 'cfu per 100mL')
To see the sites with highest Enterococci concentration in liter, we do a table below.
setNames(aggregate(data$Enterococci..cfu.100ml.,list(data$Site), FUN = median, na.rm = TRUE), c('Site','Enterococci..cfu.100ml')) |> formatting_function(caption = "Table : Median Enterococci Concentration by Site")
| Site | Enterococci..cfu.100ml |
|---|---|
| Avalon Beach | 0.0 |
| Bilarong Reserve (Narrabeen Lagoon) | 12.5 |
| Bilgola Beach | 0.0 |
| Birdwood Park (Narrabeen Lagoon) | 17.5 |
| Bungan Beach | 0.0 |
| Collaroy Beach | 2.0 |
| Dee Why Beach | 2.5 |
| Freshwater Beach | 3.0 |
| Long Reef Beach | 1.0 |
| Mona Vale Beach | 0.0 |
| Newport Beach | 0.0 |
| North Curl Curl Beach | 2.0 |
| North Narrabeen Beach | 0.0 |
| North Steyne Beach | 2.5 |
| Palm Beach | 0.0 |
| Queenscliff Beach | 2.5 |
| Shelly Beach (Manly) | 5.0 |
| South Curl Curl Beach | 1.0 |
| South Steyne Beach | 10.0 |
| Turimetta Beach | 0.0 |
| Warriewood Beach | 0.0 |
| Whale Beach | 0.0 |
data_outlier = data[data$Enterococci..cfu.100ml.<=100,]
setNames(aggregate(data_outlier$Enterococci..cfu.100ml., list(data_outlier$Site), FUN = sd, na.rm = TRUE), c('Month','Enterococci..cfu.100ml')) |> formatting_function(caption = "Table : Dispersion in Water Quality")
| Month | Enterococci..cfu.100ml |
|---|---|
| Avalon Beach | 4.886275 |
| Bilarong Reserve (Narrabeen Lagoon) | 24.062795 |
| Bilgola Beach | 10.755850 |
| Birdwood Park (Narrabeen Lagoon) | 20.072969 |
| Bungan Beach | 13.195274 |
| Collaroy Beach | 13.768847 |
| Dee Why Beach | 10.796869 |
| Freshwater Beach | 15.295474 |
| Long Reef Beach | 13.258973 |
| Mona Vale Beach | 14.340725 |
| Newport Beach | 9.834852 |
| North Curl Curl Beach | 12.962521 |
| North Narrabeen Beach | 15.555221 |
| North Steyne Beach | 10.136679 |
| Palm Beach | 9.042415 |
| Queenscliff Beach | 17.181189 |
| Shelly Beach (Manly) | 17.073371 |
| South Curl Curl Beach | 12.564173 |
| South Steyne Beach | 15.734650 |
| Turimetta Beach | 5.627927 |
| Warriewood Beach | 10.285048 |
| Whale Beach | 3.712343 |
Is the concentration of Enterococci in Birdwood Park significantly different from that in Bilarong Reserve The hypotheses for this test are as follows:
H0: There is no difference of means of sample of Bilarong Reserve (Narrabeen Lagoon) and the sample of Birdwood Park (Narrabeen Lagoon).
H1:There is a difference between the means of two samples of Birdwood Park (Narrabeen Lagoon) and Bilarong Reserve (Narrabeen Lagoon).
Further, we assume that;
Further, we use a significance level of 0.05. - Conclusion:
The p-value is 0.207 is bigger than the significance level 0.05. Hence,
we fail to reject the null hypothesis.
## select specific rows
diff_1 = data[data$BeachId == 10.1, ]
diff_2 = data[data$BeachId == 10.2, ]
a1 = diff_1$Enterococci..cfu.100ml.
a2 = diff_2$Enterococci..cfu.100ml.
n1 = length(a1)
n2 = length(a2)
m1 = mean(a1)
m2 = mean(a2)
sdp2 = ((n1-1)*sd(a1)^2+(n2-1)*sd(a2)^2)/(n1 + n2 -2)
se = sqrt(sdp2*(1/n1+1/n2))
Teststat = (m1-m2-0)/se
Pvalue = 2*pt(abs(Teststat), n1+n2-2, lower.tail = F)
print(c(Pvalue, Teststat))
## [1] 0.2070082 1.2690498
## Check
c = t.test(a1,a2,var.equal = T)
print(c)
##
## Two Sample t-test
##
## data: a1 and a2
## t = 1.269, df = 114, p-value = 0.207
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -67.53331 308.29194
## sample estimates:
## mean of x mean of y
## 171.62069 51.24138
1.Australian Government - Bureau of Meteorology. (2022). Annual
Australian climate statement 2021. Australian Government - Bureau of
Meteorology.
2.NSW Government. (2022). State of the beaches 2021–22. NSW
Government.
3.NSW Government. (2021). Narrabeen Lagoon: NSW Environment and
Heritage. NSW Government.