Introduction

The Yarra River catchment encompasses a large area, ranging from urban to rural areas. The river itself is used for various recreational activities, which rely on good water quality
Yarra Watch is a program providing long term monitoring and information regarding the rivers water quality. Yarra Watch measures water quality along the river at various sites
- The most upstream site measured is Launching place(LP), a rural area.
- The most downstream site measured is Kew, a highly urbanised area.
E.coli is a bacteria found in all fecal waste, it is the indicator used to measure recreational water quality, it is measured per 100ml (E.coli/100ml)
For context, E.coli/100ml >550 is unsuitable for all water based recreation (swimming, boating, kayaking etc), according the EPA.
Extremely high E.coli readings have been historically recorded after large rainfall events, with water quality being affected for up-to 5 days after.
This study aims to compare mean E.coli levels between Launching Place and Kew. These sites vary greatly between land use types and population densities.
Kew has a higher population density, and more storm water runoff.
LP has less people, but increased agricultural runoff.

Problem Statement

My research question asks whether there is a statistically significant difference in mean E.coli levels between Kew and Launching Place. The Research Hypothesis being there will be a significant difference in means. To test this statistically I will:

Generate descriptive statistics and visualizations
Identify outliers, and decide what to course of action to take
Determine if transformations are necessary
check for assumptions relating to the two-sample t-test
The resulting t-test will generate a p-value which will be used to determine if there is a statistically significant difference in means between Kew and Launching Place
Allowing me to reject or fail to reject the null hypothesis ($H_0: \mu_1 - \mu_2 = 0$)

Data

This data contains observations from 2013-2022 measuring the E.coli levels across various testing sites on the Yarra River.
The data was collected through systematic sampling, with predetermined locations and times
It is available under a creative commons license here: https://discover.data.vic.gov.au/dataset/yarra-watch-e-coli-data
Read_excel can only load one sheet at once, so two separate data loads were undertaken to load the Kew and Launching place data sets
I used the coltypes= argument within read_excel for two reasons: 1.”Sample date/time” column was being read in as integers (4443322..) when using read_excels default “guess” behavior, I specified it as “date”

I could exclude the last three columns of both spreadsheets as they contained variables with very few values and were not needed, specified with “skip”

kew_raw <- read_excel("Yarra_Watch_E.coli_data.xlsx", sheet = 1, col_types = 
                        c("guess","text", "date","numeric","skip",
                          "skip","skip"))

launch.p_raw <- read_excel("Yarra_Watch_E.coli_data.xlsx", sheet = 4,col_types = 
                             c("guess","text", "date","numeric","skip",
                               "skip","skip"))

#Data cont. - merge the two data sets with bind_rows as they have the exact same variables create two new columns separating date into “Year” and “Month”, not necessary but was done in case I wanted to investigate changes over time at a later stage.

Removing the time values completely as they are irrelevant to analysis

ecoli_df <-  bind_rows(kew_raw, launch.p_raw) %>% 
  separate(`Sample date/time`, into = c("Year", "Month"),
           sep = "-")

Data Cont.

Variables remaining:

Site name - the site on the Yarra River where E.coli measurements were taken a factor variable (nominal) with two levels for the two sites filtered
Site ID - factor variable with two levels corresponding to the two site ID numbers which relate to the site names
Year - year of observation - factor with 10 levels for each year (2013-2022)
Month - month of observation - factor with 5 levels (observations were only taken for 5 months)
E.coli value (orgs/100ml) - what we are measuring, a numeric variable describing the amount of E.coli bacteria per 100ml of river water tested

unique(ecoli_df$`Site ID`)

## [1]   4940 291600

ecoli_df$`Site ID` <- ecoli_df$`Site ID` %>% 
  factor(levels = c("4940", "291600"))

#check values for year
unique(ecoli_df$Year)

##  [1] "2022" "2021" NA     "2020" "2019" "2018" "2017" "2016" "2015" "2014"
## [11] "2013"

#change year to factor 
ecoli_df$Year <- ecoli_df$Year %>% 
  factor(levels = c("2013", "2014", "2015",
                    "2016", "2017", "2018",
                    "2019", "2020", "2021", "2022"))

#check values for site name
unique(ecoli_df)

#change site name to factor
ecoli_df$`Site name` <- ecoli_df$`Site name` %>% 
  factor(c("Kew", "Launching Place"))

unique(ecoli_df$Month)

## [1] "03" "02" "01" "12" NA   "04"

#change month to factor - reassign labels for levels
ecoli_df$Month <-  ecoli_df$Month %>%
  factor(levels = c("01","02","03","04","12"),
         labels = c("Jan", "Feb", "March","April", "Dec"))

#check all is done correctly
str(ecoli_df)

## tibble [272 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Site ID                    : Factor w/ 2 levels "4940","291600": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Site name                  : Factor w/ 2 levels "Kew","Launching Place": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year                       : Factor w/ 10 levels "2013","2014",..: 10 10 10 10 10 10 10 10 10 9 ...
##  $ Month                      : Factor w/ 5 levels "Jan","Feb","March",..: 3 2 2 2 2 1 1 1 1 5 ...
##  $ E. coli value (orgs/100 mL): num [1:272] 4600 74 72 130 320 910 41 180 140 63 ...

Descriptive Statistics and Visualisation

Summarise the important variables in your investigation.
Use visualisation to highlight interesting features of the data and tell the overall story.
Explain how you dealt with data issues (if any), e.g. missing data and outliers.
Here are the examples of R chunks and outputs
Both data sets have some high outliers, especially the Kew data set(>20000),as illustrated by the side-by side box plot
The sheer scale of these large values in Kew makes comparing the two groups raw observations via a box plot futile
Separate histograms for each group will give another useful visualisation
Both data sets are skewed heavily to the right via these extremely large outliers, as can be seen with the histograms.

plot(`E. coli value (orgs/100 mL)`~ `Site name`, data=ecoli_df, 
     main = "Boxplot of E.coli levels (empirical data) 
     at different sites on the Yarra River (2013-2022)")

#Subset for histograms
kew <- ecoli_df %>% filter(`Site name`%in% "Kew")
launching_place <- ecoli_df %>% filter(`Site name`%in% "Launching Place")
hist(kew$`E. coli value (orgs/100 mL)`, xlab = "E. coli (orgs/100 mL)",
     main = "Histogram of E.coli levels in Kew (2013-2022)")

hist(launching_place$`E. coli value (orgs/100 mL)`,
     xlab = "E. coli (orgs/100 mL)", 
     main = "Histogram of E.coli levels in Launching Place (2013-2022)")

Decsriptive Statistics Cont.

Although these outliers would seem impossible, background research into previous Yarra Watch reports shows E.coli levels >24000, and can be explained by large rainfall events,subsequently they will not be removed for Kew or LP.
Kew has a much higher max value (24000) compared to L.P (2600)
This results in a higher mean for Kew (660), compared to L.P (403)
This is unsurprising given the mean is highly sensitive to outliers or skewed data, in such cases it is not the best measure of central tendency.
The median is more robust to these distributions types,
this is evidenced when comparing the two groups medians (Kew = 170, L.P = 310)
the median is much higher in L.P than Kew.The opposite to the mean.
The extremely high variability in the Kew group is succinctly represented by its high standard deviation (2360), and although the L.P group is also skewed to the right and not normally distributed, its variability is much smaller than Kews (L.P sd = 366).
The IQR for both sets is similar, this statistic is impervious to outliers as it represents the middle of the data distribution, one can test for normal distribution by using the IQR, it should represent 1.33 standard deviations.
This is certainly not the case for Kew, and is also not the case for L.P (sd*1.33 = 468, not the LP IQR of 200)
Missing values for both groups will be excluded as the sample size for each group is large enough(>30) (Kew n =132; LP n = 140) to assume the sampling distribution of the mean will approximate a normal distribution.
This assumption of normality will be important when conducting a two-sample t-test as it compares the means between the two groups.

ecoli_df %>%   group_by(`Site name`) %>%
  summarise(Min = min(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            Q1 = quantile(`E. coli value (orgs/100 mL)`, probs = 0.25, na.rm =TRUE),
            Mean = mean(`E. coli value (orgs/100 mL)`, na.rm=TRUE),
            Median = median(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            Q3 = quantile(`E. coli value (orgs/100 mL)`, probs = 0.75, na.rm = TRUE),
            Max = max(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            SD = sd (`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            IQR = IQR(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            n = n(),
            Missing = sum(is.na(`E. coli value (orgs/100 mL)`))) -> table1

knitr::kable(table1)

Site name	Min	Q1	Mean	Median	Q3	Max	SD	IQR	n	Missing
Kew	10	84.5	660.4580	170	345	24000	2359.8472	260.5	132	1
Launching Place	96	220.0	403.0465	310	420	2600	366.3122	200.0	140	11

#Transformation - These data will not undergo any transformations prior to hypothesis testing - Each group has a large sample size (n=132, n=140), so although both groups data is skewed to the right, the sampling distributions of the mean will equal a normal distribution, this concept is known as the Central Limit Theorem

kew$`E. coli value (orgs/100 mL)` %>% qqPlot(dist = "norm")

## [1] 69 31

launching_place$`E. coli value (orgs/100 mL)` %>% 
  qqPlot(dist ="norm")

## [1] 36 77

The above Q-Q plots are not needed, however they visualize the data distribution and compare that to what would be expected if the data did follow a normal distribution. If data followed a normal distribution the point would fall within the blue section, as we can see this is not the case.

Hypothesis Testing

Because we are investigating if there is difference in mean E.coli levels between two independent sites on the Yarra River we will use a two-sample t-test
This t-test will be two sided as are not concerned with which mean is higher, rather, if there is any difference in the means
Before conducting this t-test there are assumptions that must hold true

The data must be normally distributed: Yes, this is the case for our data as both samples are large enough that the sampling distribution of the respective means will equal a normal distribution as per the Central Limit Theorem
Homogeneity of variance - because of the highly skewed data the levenes test automatically uses the median as opposed to the mean as the point estimate of central tendency
The two populations being compared are independent of each other - Yes,one cannot have the same E.coli bacteria in two places at once.

#Homogeneity of variance
leveneTest(`E. coli value (orgs/100 mL)`~ `Site name`, data = ecoli_df)

#two sample t-test
t.test(`E. coli value (orgs/100 mL)`~ `Site name`, data = ecoli_df, var.equal=TRUE,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  E. coli value (orgs/100 mL) by Site name
## t = 1.2244, df = 258, p-value = 0.2219
## alternative hypothesis: true difference in means between group Kew and group Launching Place is not equal to 0
## 95 percent confidence interval:
##  -156.5725  671.3956
## sample estimates:
##             mean in group Kew mean in group Launching Place 
##                      660.4580                      403.0465

Hypothesis Testing Cont.

Interpreting the results:

LeveneTest

This test reports a p-value to the 0.05 significance level
The test assumes equal variance:$H_0: _1 = _2 $
Because p = 0.06. We see p>0.05, therefore we fail to reject the null and the groups have equal variance

Two-sample t-test

Because of equal variance we can conduct a normal two-sample t-test

Null hypothesis: That the difference between the two population means is 0: \[H_0: \mu_1 - \mu_2 = 0\]

Alternate Hypothesis: That the difference between the two population means is not equal to 0 \[H_A: \mu_1 -\mu_2 \ne 0\]

A two-sample t-test was used to test for a significant difference in the mean E.coli levels at Kew and Launching Place. E.coli levels at both Kew and Launching Place exhibited non-normality, as seen via histograms, descriptive statistics and Q-Qplots. However, the central limit theorem ensured that the t-test could be applied thanks to the large sample sizes in both groups. Homogeneity of variance could be assumed, as indicated through the Levene’s Test. The results of the two sample t-test assuming equal variance did not find a statistically significant difference in mean E.coli levels between Kew and Launching Place, t(df=258)=1.22, p=0.22, 95% CI for the difference in means [-156.6 -671.4]. These results of this investigation indicate there is no significant difference in mean E.coli levels between Kew and Launching Place.

Discussion

Strengths

large sample sizes were obtained for both groups (n=132, n=140)
These large samples meant there were no log transformations required (something normally done to right skewed data) for hypothesis testing, therefore, the statistical results come from the original data and not altered data
This means the results directly allow for statistical inferences to be made regarding the hypotheses and the investigation.

Limitations

The extremely high outliers in the Kew data set (24000) have dragged the mean to the right significantly, perhaps comparing means in such highly skewed data is not the best form of hypothesis testing.

Conclusions

This investigation looked to determine whether there was a difference in mean levels of E.coli between Kew and LP
This investigation found no statistically significant difference in E.coli levels between Kew and L.P, as confirmed by the p=0.22 (>0.05).
Further research could be done to investigate the changes in E.coli levels over time

Comparing the mean differecnces in E.coli levels on the Yarra River