The aim of this report is to investigate the relationship between time (month) and location (beach) and enterococci bacteria levels in Sydney’s Northern Beaches. The first research question investigated the relationship between time and bacterial levels. It was found that there was higher bacterial levels in summer months, as well as larger variance in both summer/winter months. The second research question investigated the impact of location in the top 5 beaches on bacterial levels. It was found that location did impact the bacterial levels, with Dee Why and Shelly Beach having the highest bacterial levels overall. Despite the findings, both questions resulted in rather weak and correlational results, which must be considered in future application of these findings.
The data was sourced from the NSW Government Office of Environment and Heritage (NSW Government, 2019). This data would be rather valid due to its reputable origin, being from a governmental source. This data was originally collected to provide the community with reliable information about the hygiene of swimming areas as well as providing researchers/decision-makers with data to assess the impacts of pollution and wastewater management. Undoubtedly, there are going to be confounding variables that impact that data collection, such as weather variance, location of sampling and unforeseen contaminating circumstances. There are also limitations in the data itself, with missing or NA values: 3 bacteria values. Nevertheless, the dataset is still highly valid, possessing a large sample size.
The primary stakeholder in this study is primarily the residents of the Northern Beaches, who are most likely going to frequent the beaches studied and be impacted by the bacterial levels in the waters. Despite this, the data and its findings will more broadly be applicable to coastal beach regions around Australia. This will also inform future scientific investigation into bacterial levels in coastal regions and their sources/impacts.
data = read.csv("beaches.csv")
dim(data)
## [1] 1276 11
str(data)
## 'data.frame': 1276 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ BeachId : num 4 4 4 4 4 4 4 4 4 4 ...
## $ Region : chr "Sydney Northern Ocean Beaches" "Sydney Northern Ocean Beaches" "Sydney Northern Ocean Beaches" "Sydney Northern Ocean Beaches" ...
## $ Council : chr "Northern Beaches Council" "Northern Beaches Council" "Northern Beaches Council" "Northern Beaches Council" ...
## $ Site : chr "Avalon Beach" "Avalon Beach" "Avalon Beach" "Avalon Beach" ...
## $ Longitude : num 151 151 151 151 151 ...
## $ Latitude : num -33.6 -33.6 -33.6 -33.6 -33.6 ...
## $ Date : chr "2018-01-25" "2018-02-07" "2018-02-19" "2018-01-19" ...
## $ Enterococci..cfu.100ml.: int 6 0 1 1 2 0 33 5 0 0 ...
## $ Month : chr "January" "February" "February" "January" ...
## $ Day.of.Week : chr "Thursday" "Wednesday" "Monday" "Friday" ...
The dataset consists of 11 variables with the following 3 being relevant:
Enterococci (cfu/100ml)- integer numerical variable which measures the amount of bacteria in the water samples. Values over 3000 and 50 for each questions 1 and 2 respectively are removed for the sake of the investigation as they are considered anomalous and confounding.
Bacteria=data$Enterococci..cfu.100ml.
class(Bacteria)
## [1] "integer"
Month- categorical variable which indicates in which month the water samples were collected.The variable is made into a ordered categorical variable for analysis.
Month=data$Month
class(Month)
## [1] "character"
Ordering the Variable “Month”
Month <- ordered(Month, levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
class(Month)
## [1] "ordered" "factor"
Site- a categorical variable indicating at which beach the water sample was collected.
Beach=data$Site
class(Beach)
## [1] "character"
The relationship between month of collection and bacterial levels can be depicted by a barplot, visually showing the trend of mean bacterial levels in each month. Therefore, the visual trend (if any) can be evaluated and justified.
All bacterial values over 3000 were removed as they are considered unusual and anomalous; they would impact the investigation of the data.
Bacteria1=Bacteria[Bacteria<3000]
meanjan=mean(Bacteria1[Month=="January"], na.rm = TRUE)
meanfeb=mean(Bacteria1[Month=="February"], na.rm = TRUE)
meanmar=mean(Bacteria1[Month=="March"], na.rm = TRUE)
meanapr=mean(Bacteria1[Month=="April"], na.rm = TRUE)
meanmay=mean(Bacteria1[Month=="May"], na.rm = TRUE)
meanjun=mean(Bacteria1[Month=="June"], na.rm = TRUE)
meanjul=mean(Bacteria1[Month=="July"],na.rm = TRUE)
meanaug=mean(Bacteria1[Month=="August"], na.rm = TRUE)
meansep=mean(Bacteria1[Month=="September"], na.rm = TRUE)
meanoct=mean(Bacteria1[Month=="October"], na.rm = TRUE)
meannov=mean(Bacteria1[Month=="November"], na.rm = TRUE)
meandec=mean(Bacteria1[Month=="December"], na.rm = TRUE)
meanmonth <- c(meanjan,meanfeb,meanmar,meanapr,meanmay,meanjun,meanjul,meanaug,meansep,meanoct,meannov,meandec)
barplot(meanmonth,main="Figure 1.1: Mean Enterococci Bacteria in each Month", xlab="Month", ylab="Enterococci Bacteria Observed (cfu/100ml)", names.arg = c("Jan","Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
distancefrommean = (meanmonth-mean(meanmonth))
data.frame(month.name,meanmonth, distancefrommean )
## month.name meanmonth distancefrommean
## 1 January 53.018349 30.3371942
## 2 February 48.449541 25.7683869
## 3 March 22.036364 -0.6447908
## 4 April 10.654545 -12.0266090
## 5 May 9.886364 -12.7947908
## 6 June 36.386364 13.7052092
## 7 July 5.236364 -17.4447908
## 8 August 27.844037 5.1628823
## 9 September 27.311927 4.6307722
## 10 October 10.100000 -12.5811544
## 11 November 7.659091 -15.0220635
## 12 December 13.590909 -9.0902453
There seems to be higher enterococci bacteria levels in majority of the summer months, coupled by more variance (from the mean) in winter months. These deviations from the mean could be caused by increased traffic on the beach in summer and spring, causing heightened bacterial levels just from human usage. The spike in bacterial levels in June and August could be accounted by the ‘above average rainfall’ observed (Bureau of Meteorology, 2019); bacterial levels increase after rainfall (Powers et al, 2020). Despite the findings observed, there seems to be no clear trend amongst the data, grouped into seasons or generally as the year progresses. In observing the ‘distancefrommean’ variable, there seems to be more variance during summer and winter, with significantly more distance in summer, particularly in January (30.3). There is a slight oscillating trend in this variable, possibly indicating the more ‘unpredictable’ and ‘extreme’ nature of bacterial levels in summer and winter months; when weather tends to be more drastic.
Thus, there seems to be a slight correlation between heightened bacterial levels in summer as well as more variable bacterial levels in summer and winter, however, these findings are rather unclear and need to be supported by further investigation.
Significance level= 0.05
Hypothesis: if p is the proportion of bacterial measurements above 50cfu/100mL the hypotheses will be tested.
Ho: p=0.05
H1:p≠0.05
For the Z test, the two assumptions are made
1. The observations are independent of each other, and that one measurement of bacterial levels on one occasion will not impact the measurement on a different occasion.
2. The sample size is large enough to satisfy the Central Limit Theorem; this is affirmed by the sample size of 1275.
Test Statistic:
Assuming p=0.05, the box:
mean=0.05
sd = (1-0) * sqrt(0.05 * 0.95)
c(mean,sd)
## [1] 0.0500000 0.2179449
Sum of sample
n = 1275
EV = mean * n
SE = sd * sqrt(n)
c(EV, SE)
## [1] 63.750000 7.782191
Calculating Observed Value and Test Statistic
OV= length(Bacteria1[Bacteria1>50])
test.stat = (OV - EV)/SE
test.stat
## [1] 2.602095
The observed value (84) is 2.6 standard errors above the expected value (assuming p=0.05).
P-value:
2*pnorm((test.stat), lower.tail=FALSE)
## [1] 0.009265621
Since the p-value is lower than the significance level, we reject the null hypothesis and conclude that the data does provide strong evidence that a proportion of 0.05 bacterial measurements were above 50cfu/100mL.
By depicting the spread of bacterial levels recorded over the top 5 beaches, different methods can be utilised to determine the beach with the highest bacterial levels.
The ‘top 5’ beaches were evaluated (Katz, 2014) and were isolated in the ‘Beach’ variable. The bacterial values above 50 were omitted as they would impact the visual presentation of the data in a boxplot. Despite this, they are also considered anomalous in considering ‘average’ bacterial levels.
Beach1<-Beach[Beach %in% c("North Steyne Beach","Dee Why Beach", "Shelly Beach (Manly)", "Freshwater Beach", "Avalon Beach")]
Bacteria2=Bacteria[Bacteria<50]
Bacteria2=Bacteria2[Beach %in% c("North Steyne Beach","Dee Why Beach", "Shelly Beach (Manly)", "Freshwater Beach", "Avalon Beach")]
Beach2<-as.factor(Beach1)
A boxplot was utilised to present this data as it would clearly present the trend of bacterial levels in all 5 beaches, allowing for easy comparison and evaluation. The horizontal formatting was also used for ease of viewing and comparison.
boxplot(Bacteria2~Beach2, horizontal=TRUE, las=2, par(mar = c(10, 10, 1, 1)), xlab="Bacteria", ylab="" , main="Figure 2.1: Bacterial Levels on the 'Top 5' Beaches")
summary(Bacteria2[Beach2=="Shelly Beach (Manly)"])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 2.000 6.316 11.000 28.000 1
summary(Bacteria2[Beach2=="North Steyne Beach"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.500 2.966 2.000 34.000
summary(Bacteria2[Beach2=="Freshwater Beach"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 4.017 4.000 36.000
summary(Bacteria2[Beach2=="Dee Why Beach"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.500 6.621 6.750 46.000
summary(Bacteria2[Beach2=="Avalon Beach"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 1.81 1.00 33.00
There seems to be higher median bacterial levels in both Dee Why and Shelly Beach, with the latter having a larger spread as visually indicated in Figure 2.1. If the mean or maximum is considered, Dee Why Beach has the highest levels of bacteria in both instances. However, when the third quartile is considered, Shelly Beach has a similar mean, but wider spread, placing its general distribution of bacterial levels higher than Dee Why Beach. Therefore, with respect to different methodologies, both Dee Why and Shelly Beach have the highest bacterial levels in the top 5 beaches.
Due to its location near Sydney Harbour, this puts Shelly Beach at risk to faecal contamination though stormwater sources and pollution (Kay, 2016) . There also seems to be a decreasing trend as the beaches’ location tends further away from the harbour, with could be a possible source of causation for the bacterial levels, coupled with geographical location in relation to drains and contaminants.
Thus, within the top 5 beaches, Dee Why and Shelly Beach seem to have the highest bacterial levels, considering mean, spread and maximum values. There also seems to be correlation between geographical location and bacterial levels, which could be further investigated in future research.
Bureau of Meteorology. (2022). Greater Sydney in 2018: warm and generally drier than normal. Bureau of Meteorology. Retrieved 20 May 2022, from http://www.bom.gov.au/climate/current/annual/nsw/archive/2018.sydney.shtml.
Katz, I. (2014). Sydneys Northern Beaches - The Top 10 Beaches. Weekend Notes. Retrieved 20 May 2022, from https://www.weekendnotes.com/beaches-sydneys-northern-beaches/.
Kay, B. (2016). State of the Beaches report suggests swimmers could get sick at Shelly Beach. Daily Telegraph. Retrieved 20 May 2022, from https://www.dailytelegraph.com.au/newslocal/manly-daily/state-of-the-beaches-report-suggests-swimmers-could-get-sick-at-shelly-beach/news-story/dd75fb10520296d19ac44868818bc949.
NSW Government. (2019). State of the beaches 2018-2019 [pdf] (1st ed.). Department of Planning, Industry and Environment. Retrieved 20 May 2022, from https://www.environment.nsw.gov.au/-/media/OEH/Corporate-Site/Documents/Water/Beaches/state-of-beaches-2018-2019-sydney-190313.pdf?la=en&hash=B9790EC76980BED9F954E3ABDBE3AB08F801E8AD.
Powers, N., Wallgren, H., Marbach, S., & Turner, J. (2020). Relationship between Rainfall, Fecal Pollution, Antimicrobial Resistance, and Microbial Diversity in an Urbanized Subtropical Bay. Applied And Environmental Microbiology, 86(19). https://doi.org/10.1128/aem.01229-20