Comparing the mean differecnces in E.coli levels on
the Yarra River
Alexander Musumeci #s3933758
28th May 2022
Introduction
- The Yarra River catchment encompasses a large area, ranging from
urban to rural areas. The river itself is used for various recreational
activities, which rely on good water quality
- Yarra Watch is a program providing long term monitoring and
information regarding the rivers water quality. Yarra Watch measures
water quality along the river at various sites
- The most upstream site measured is Launching place(LP), a rural
area.
- The most downstream site measured is Kew, a highly urbanised
area.
- E.coli is a bacteria found in all fecal waste, it is the indicator
used to measure recreational water quality, it is measured per 100ml
(E.coli/100ml)
- For context, E.coli/100ml >550 is unsuitable for all water based
recreation (swimming, boating, kayaking etc), according the EPA.
- Extremely high E.coli readings have been historically recorded after
large rainfall events, with water quality being affected for up-to 5
days after.
- This study aims to compare mean E.coli levels between Launching
Place and Kew. These sites vary greatly between land use types and
population densities.
- Kew has a higher population density, and more storm water
runoff.
- LP has less people, but increased agricultural runoff.
Problem Statement
My research question asks whether there is a statistically
significant difference in mean E.coli levels between Kew and Launching
Place. The Research Hypothesis being there will be a significant
difference in means. To test this statistically I will:
- Generate descriptive statistics and visualizations
- Identify outliers, and decide what to course of action to take
- Determine if transformations are necessary
- check for assumptions relating to the two-sample t-test
- The resulting t-test will generate a p-value which will be used to
determine if there is a statistically significant difference in means
between Kew and Launching Place
- Allowing me to reject or fail to reject the null hypothesis (\(H_0: \mu_1 - \mu_2 = 0\))
Data
- This data contains observations from 2013-2022 measuring the E.coli
levels across various testing sites on the Yarra River.
- The data was collected through systematic sampling, with
predetermined locations and times
- It is available under a creative commons license here: https://discover.data.vic.gov.au/dataset/yarra-watch-e-coli-data
- Read_excel can only load one sheet at once, so two separate data
loads were undertaken to load the Kew and Launching place data sets
- I used the coltypes= argument within read_excel for two reasons:
1.”Sample date/time” column was being read in as integers (4443322..)
when using read_excels default “guess” behavior, I specified it as
“date”
- I could exclude the last three columns of both spreadsheets as they
contained variables with very few values and were not needed, specified
with “skip”
kew_raw <- read_excel("Yarra_Watch_E.coli_data.xlsx", sheet = 1, col_types =
c("guess","text", "date","numeric","skip",
"skip","skip"))
launch.p_raw <- read_excel("Yarra_Watch_E.coli_data.xlsx", sheet = 4,col_types =
c("guess","text", "date","numeric","skip",
"skip","skip"))
#Data cont. - merge the two data sets with bind_rows as they have the
exact same variables create two new columns separating date into “Year”
and “Month”, not necessary but was done in case I wanted to investigate
changes over time at a later stage.
- Removing the time values completely as they are irrelevant to
analysis
ecoli_df <- bind_rows(kew_raw, launch.p_raw) %>%
separate(`Sample date/time`, into = c("Year", "Month"),
sep = "-")
Data Cont.
Variables remaining:
- Site name - the site on the Yarra River where E.coli measurements
were taken a factor variable (nominal) with two levels for the two sites
filtered
- Site ID - factor variable with two levels corresponding to the two
site ID numbers which relate to the site names
- Year - year of observation - factor with 10 levels for each year
(2013-2022)
- Month - month of observation - factor with 5 levels (observations
were only taken for 5 months)
- E.coli value (orgs/100ml) - what we are measuring, a numeric
variable describing the amount of E.coli bacteria per 100ml of river
water tested
unique(ecoli_df$`Site ID`)
## [1] 4940 291600
ecoli_df$`Site ID` <- ecoli_df$`Site ID` %>%
factor(levels = c("4940", "291600"))
#check values for year
unique(ecoli_df$Year)
## [1] "2022" "2021" NA "2020" "2019" "2018" "2017" "2016" "2015" "2014"
## [11] "2013"
#change year to factor
ecoli_df$Year <- ecoli_df$Year %>%
factor(levels = c("2013", "2014", "2015",
"2016", "2017", "2018",
"2019", "2020", "2021", "2022"))
#check values for site name
unique(ecoli_df)
#change site name to factor
ecoli_df$`Site name` <- ecoli_df$`Site name` %>%
factor(c("Kew", "Launching Place"))
unique(ecoli_df$Month)
## [1] "03" "02" "01" "12" NA "04"
#change month to factor - reassign labels for levels
ecoli_df$Month <- ecoli_df$Month %>%
factor(levels = c("01","02","03","04","12"),
labels = c("Jan", "Feb", "March","April", "Dec"))
#check all is done correctly
str(ecoli_df)
## tibble [272 × 5] (S3: tbl_df/tbl/data.frame)
## $ Site ID : Factor w/ 2 levels "4940","291600": 1 1 1 1 1 1 1 1 1 1 ...
## $ Site name : Factor w/ 2 levels "Kew","Launching Place": 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : Factor w/ 10 levels "2013","2014",..: 10 10 10 10 10 10 10 10 10 9 ...
## $ Month : Factor w/ 5 levels "Jan","Feb","March",..: 3 2 2 2 2 1 1 1 1 5 ...
## $ E. coli value (orgs/100 mL): num [1:272] 4600 74 72 130 320 910 41 180 140 63 ...
Descriptive Statistics and Visualisation
- Summarise the important variables in your investigation.
- Use visualisation to highlight interesting features of the data and
tell the overall story.
- Explain how you dealt with data issues (if any), e.g. missing data
and outliers.
- Here are the examples of R chunks and outputs
- Both data sets have some high outliers, especially the Kew data
set(>20000),as illustrated by the side-by side box plot
- The sheer scale of these large values in Kew makes comparing the two
groups raw observations via a box plot futile
- Separate histograms for each group will give another useful
visualisation
- Both data sets are skewed heavily to the right via these extremely
large outliers, as can be seen with the histograms.
plot(`E. coli value (orgs/100 mL)`~ `Site name`, data=ecoli_df,
main = "Boxplot of E.coli levels (empirical data)
at different sites on the Yarra River (2013-2022)")

#Subset for histograms
kew <- ecoli_df %>% filter(`Site name`%in% "Kew")
launching_place <- ecoli_df %>% filter(`Site name`%in% "Launching Place")
hist(kew$`E. coli value (orgs/100 mL)`, xlab = "E. coli (orgs/100 mL)",
main = "Histogram of E.coli levels in Kew (2013-2022)")

hist(launching_place$`E. coli value (orgs/100 mL)`,
xlab = "E. coli (orgs/100 mL)",
main = "Histogram of E.coli levels in Launching Place (2013-2022)")

Decsriptive Statistics Cont.
- Although these outliers would seem impossible, background research
into previous Yarra Watch reports shows E.coli levels >24000, and can
be explained by large rainfall events,subsequently they will not be
removed for Kew or LP.
- Kew has a much higher max value (24000) compared to L.P (2600)
- This results in a higher mean for Kew (660), compared to L.P
(403)
- This is unsurprising given the mean is highly sensitive to outliers
or skewed data, in such cases it is not the best measure of central
tendency.
- The median is more robust to these distributions types,
- this is evidenced when comparing the two groups medians (Kew = 170,
L.P = 310)
- the median is much higher in L.P than Kew.The opposite to the
mean.
- The extremely high variability in the Kew group is succinctly
represented by its high standard deviation (2360), and although the L.P
group is also skewed to the right and not normally distributed, its
variability is much smaller than Kews (L.P sd = 366).
- The IQR for both sets is similar, this statistic is impervious to
outliers as it represents the middle of the data distribution, one can
test for normal distribution by using the IQR, it should represent 1.33
standard deviations.
- This is certainly not the case for Kew, and is also not the case for
L.P (sd*1.33 = 468, not the LP IQR of 200)
- Missing values for both groups will be excluded as the sample size
for each group is large enough(>30) (Kew n =132; LP n = 140) to
assume the sampling distribution of the mean will approximate a normal
distribution.
- This assumption of normality will be important when conducting a
two-sample t-test as it compares the means between the two groups.
ecoli_df %>% group_by(`Site name`) %>%
summarise(Min = min(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
Q1 = quantile(`E. coli value (orgs/100 mL)`, probs = 0.25, na.rm =TRUE),
Mean = mean(`E. coli value (orgs/100 mL)`, na.rm=TRUE),
Median = median(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
Q3 = quantile(`E. coli value (orgs/100 mL)`, probs = 0.75, na.rm = TRUE),
Max = max(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
SD = sd (`E. coli value (orgs/100 mL)`, na.rm = TRUE),
IQR = IQR(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`E. coli value (orgs/100 mL)`))) -> table1
knitr::kable(table1)
| Kew |
10 |
84.5 |
660.4580 |
170 |
345 |
24000 |
2359.8472 |
260.5 |
132 |
1 |
| Launching Place |
96 |
220.0 |
403.0465 |
310 |
420 |
2600 |
366.3122 |
200.0 |
140 |
11 |
#Transformation - These data will not undergo any transformations
prior to hypothesis testing - Each group has a large sample size (n=132,
n=140), so although both groups data is skewed to the right, the
sampling distributions of the mean will equal a normal distribution,
this concept is known as the Central Limit Theorem
kew$`E. coli value (orgs/100 mL)` %>% qqPlot(dist = "norm")

## [1] 69 31
launching_place$`E. coli value (orgs/100 mL)` %>%
qqPlot(dist ="norm")

## [1] 36 77
The above Q-Q plots are not needed, however they visualize the data
distribution and compare that to what would be expected if the data did
follow a normal distribution. If data followed a normal distribution the
point would fall within the blue section, as we can see this is not the
case.
Hypothesis Testing
- Because we are investigating if there is difference in mean E.coli
levels between two independent sites on the Yarra River we will use a
two-sample t-test
- This t-test will be two sided as are not concerned with which mean
is higher, rather, if there is any difference in the means
- Before conducting this t-test there are assumptions that must hold
true
- The data must be normally distributed: Yes, this is the case for our
data as both samples are large enough that the sampling distribution of
the respective means will equal a normal distribution as per the Central
Limit Theorem
- Homogeneity of variance - because of the highly skewed data the
levenes test automatically uses the median as opposed to the mean as the
point estimate of central tendency
- The two populations being compared are independent of each other -
Yes,one cannot have the same E.coli bacteria in two places at once.
#Homogeneity of variance
leveneTest(`E. coli value (orgs/100 mL)`~ `Site name`, data = ecoli_df)
#two sample t-test
t.test(`E. coli value (orgs/100 mL)`~ `Site name`, data = ecoli_df, var.equal=TRUE,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: E. coli value (orgs/100 mL) by Site name
## t = 1.2244, df = 258, p-value = 0.2219
## alternative hypothesis: true difference in means between group Kew and group Launching Place is not equal to 0
## 95 percent confidence interval:
## -156.5725 671.3956
## sample estimates:
## mean in group Kew mean in group Launching Place
## 660.4580 403.0465
Hypothesis Testing Cont.
Interpreting the results:
LeveneTest
- This test reports a p-value to the 0.05 significance level
- The test assumes equal variance:$H_0: _1 = _2 $
- Because p = 0.06. We see p>0.05, therefore we fail to reject the
null and the groups have equal variance
Two-sample t-test
- Because of equal variance we can conduct a normal two-sample
t-test
Null hypothesis: That the difference between the two population means
is 0: \[H_0: \mu_1 - \mu_2 = 0\]
Alternate Hypothesis: That the difference between the two population
means is not equal to 0 \[H_A: \mu_1 -\mu_2
\ne 0\]
A two-sample t-test was used to test for a significant difference in
the mean E.coli levels at Kew and Launching Place. E.coli levels at both
Kew and Launching Place exhibited non-normality, as seen via histograms,
descriptive statistics and Q-Qplots. However, the central limit theorem
ensured that the t-test could be applied thanks to the large sample
sizes in both groups. Homogeneity of variance could be assumed, as
indicated through the Levene’s Test. The results of the two sample
t-test assuming equal variance did not find a statistically significant
difference in mean E.coli levels between Kew and Launching Place,
t(df=258)=1.22, p=0.22, 95% CI for the difference in means [-156.6
-671.4]. These results of this investigation indicate there is no
significant difference in mean E.coli levels between Kew and Launching
Place.
Discussion
Strengths
- large sample sizes were obtained for both groups (n=132, n=140)
- These large samples meant there were no log transformations required
(something normally done to right skewed data) for hypothesis testing,
therefore, the statistical results come from the original data and not
altered data
- This means the results directly allow for statistical inferences to
be made regarding the hypotheses and the investigation.
Limitations
- The extremely high outliers in the Kew data set (24000) have dragged
the mean to the right significantly, perhaps comparing means in such
highly skewed data is not the best form of hypothesis testing.
Conclusions
This investigation looked to determine whether there was a
difference in mean levels of E.coli between Kew and LP
This investigation found no statistically significant difference
in E.coli levels between Kew and L.P, as confirmed by the p=0.22
(>0.05).
Further research could be done to investigate the changes in
E.coli levels over time
References
Yarra watch report 2009-11. Vgls.vic.gov.au. 2022. [online] Available
at: https://www.vgls.vic.gov.au/client/en_AU/search/asset/1159478/0
[Accessed 28 May 2022].
C Feng, H Wang, Lu, N., T Chen, H He, Lu, Y. and Tu, X. (2014).
Log-transformation and its implications for data analysis. Shanghai
archives of psychiatry, [online] 26(2), pp.105–109. doi:10.3969/j.issn.1002-0829.2014.02.009.
Log-transformation and its implications for data analysis https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/