Comparing the mean differecnces in E.coli levels on the Yarra River

Alexander Musumeci #s3933758

28th May 2022

Introduction

Problem Statement

My research question asks whether there is a statistically significant difference in mean E.coli levels between Kew and Launching Place. The Research Hypothesis being there will be a significant difference in means. To test this statistically I will:

Data

  1. I could exclude the last three columns of both spreadsheets as they contained variables with very few values and were not needed, specified with “skip”
kew_raw <- read_excel("Yarra_Watch_E.coli_data.xlsx", sheet = 1, col_types = 
                        c("guess","text", "date","numeric","skip",
                          "skip","skip"))

launch.p_raw <- read_excel("Yarra_Watch_E.coli_data.xlsx", sheet = 4,col_types = 
                             c("guess","text", "date","numeric","skip",
                               "skip","skip"))

#Data cont. - merge the two data sets with bind_rows as they have the exact same variables create two new columns separating date into “Year” and “Month”, not necessary but was done in case I wanted to investigate changes over time at a later stage.

ecoli_df <-  bind_rows(kew_raw, launch.p_raw) %>% 
  separate(`Sample date/time`, into = c("Year", "Month"),
           sep = "-")

Data Cont.

Variables remaining:

unique(ecoli_df$`Site ID`)
## [1]   4940 291600
ecoli_df$`Site ID` <- ecoli_df$`Site ID` %>% 
  factor(levels = c("4940", "291600"))

#check values for year
unique(ecoli_df$Year)
##  [1] "2022" "2021" NA     "2020" "2019" "2018" "2017" "2016" "2015" "2014"
## [11] "2013"
#change year to factor 
ecoli_df$Year <- ecoli_df$Year %>% 
  factor(levels = c("2013", "2014", "2015",
                    "2016", "2017", "2018",
                    "2019", "2020", "2021", "2022"))

#check values for site name
unique(ecoli_df)
#change site name to factor
ecoli_df$`Site name` <- ecoli_df$`Site name` %>% 
  factor(c("Kew", "Launching Place"))

unique(ecoli_df$Month)
## [1] "03" "02" "01" "12" NA   "04"
#change month to factor - reassign labels for levels
ecoli_df$Month <-  ecoli_df$Month %>%
  factor(levels = c("01","02","03","04","12"),
         labels = c("Jan", "Feb", "March","April", "Dec"))

#check all is done correctly
str(ecoli_df)
## tibble [272 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Site ID                    : Factor w/ 2 levels "4940","291600": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Site name                  : Factor w/ 2 levels "Kew","Launching Place": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year                       : Factor w/ 10 levels "2013","2014",..: 10 10 10 10 10 10 10 10 10 9 ...
##  $ Month                      : Factor w/ 5 levels "Jan","Feb","March",..: 3 2 2 2 2 1 1 1 1 5 ...
##  $ E. coli value (orgs/100 mL): num [1:272] 4600 74 72 130 320 910 41 180 140 63 ...

Descriptive Statistics and Visualisation

plot(`E. coli value (orgs/100 mL)`~ `Site name`, data=ecoli_df, 
     main = "Boxplot of E.coli levels (empirical data) 
     at different sites on the Yarra River (2013-2022)")

#Subset for histograms
kew <- ecoli_df %>% filter(`Site name`%in% "Kew")
launching_place <- ecoli_df %>% filter(`Site name`%in% "Launching Place")
hist(kew$`E. coli value (orgs/100 mL)`, xlab = "E. coli (orgs/100 mL)",
     main = "Histogram of E.coli levels in Kew (2013-2022)")

hist(launching_place$`E. coli value (orgs/100 mL)`,
     xlab = "E. coli (orgs/100 mL)", 
     main = "Histogram of E.coli levels in Launching Place (2013-2022)")

Decsriptive Statistics Cont.

ecoli_df %>%   group_by(`Site name`) %>%
  summarise(Min = min(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            Q1 = quantile(`E. coli value (orgs/100 mL)`, probs = 0.25, na.rm =TRUE),
            Mean = mean(`E. coli value (orgs/100 mL)`, na.rm=TRUE),
            Median = median(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            Q3 = quantile(`E. coli value (orgs/100 mL)`, probs = 0.75, na.rm = TRUE),
            Max = max(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            SD = sd (`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            IQR = IQR(`E. coli value (orgs/100 mL)`, na.rm = TRUE),
            n = n(),
            Missing = sum(is.na(`E. coli value (orgs/100 mL)`))) -> table1

knitr::kable(table1)
Site name Min Q1 Mean Median Q3 Max SD IQR n Missing
Kew 10 84.5 660.4580 170 345 24000 2359.8472 260.5 132 1
Launching Place 96 220.0 403.0465 310 420 2600 366.3122 200.0 140 11

#Transformation - These data will not undergo any transformations prior to hypothesis testing - Each group has a large sample size (n=132, n=140), so although both groups data is skewed to the right, the sampling distributions of the mean will equal a normal distribution, this concept is known as the Central Limit Theorem

kew$`E. coli value (orgs/100 mL)` %>% qqPlot(dist = "norm")

## [1] 69 31
launching_place$`E. coli value (orgs/100 mL)` %>% 
  qqPlot(dist ="norm")

## [1] 36 77

The above Q-Q plots are not needed, however they visualize the data distribution and compare that to what would be expected if the data did follow a normal distribution. If data followed a normal distribution the point would fall within the blue section, as we can see this is not the case.

Hypothesis Testing

  1. The data must be normally distributed: Yes, this is the case for our data as both samples are large enough that the sampling distribution of the respective means will equal a normal distribution as per the Central Limit Theorem
  2. Homogeneity of variance - because of the highly skewed data the levenes test automatically uses the median as opposed to the mean as the point estimate of central tendency
  3. The two populations being compared are independent of each other - Yes,one cannot have the same E.coli bacteria in two places at once.
#Homogeneity of variance
leveneTest(`E. coli value (orgs/100 mL)`~ `Site name`, data = ecoli_df)
#two sample t-test
t.test(`E. coli value (orgs/100 mL)`~ `Site name`, data = ecoli_df, var.equal=TRUE,
       alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  E. coli value (orgs/100 mL) by Site name
## t = 1.2244, df = 258, p-value = 0.2219
## alternative hypothesis: true difference in means between group Kew and group Launching Place is not equal to 0
## 95 percent confidence interval:
##  -156.5725  671.3956
## sample estimates:
##             mean in group Kew mean in group Launching Place 
##                      660.4580                      403.0465

Hypothesis Testing Cont.

Interpreting the results:

LeveneTest

Two-sample t-test

Null hypothesis: That the difference between the two population means is 0: \[H_0: \mu_1 - \mu_2 = 0\]

Alternate Hypothesis: That the difference between the two population means is not equal to 0 \[H_A: \mu_1 -\mu_2 \ne 0\]

A two-sample t-test was used to test for a significant difference in the mean E.coli levels at Kew and Launching Place. E.coli levels at both Kew and Launching Place exhibited non-normality, as seen via histograms, descriptive statistics and Q-Qplots. However, the central limit theorem ensured that the t-test could be applied thanks to the large sample sizes in both groups. Homogeneity of variance could be assumed, as indicated through the Levene’s Test. The results of the two sample t-test assuming equal variance did not find a statistically significant difference in mean E.coli levels between Kew and Launching Place, t(df=258)=1.22, p=0.22, 95% CI for the difference in means [-156.6 -671.4]. These results of this investigation indicate there is no significant difference in mean E.coli levels between Kew and Launching Place.

Discussion

Strengths

Limitations

Conclusions

References

Yarra watch report 2009-11. Vgls.vic.gov.au. 2022. [online] Available at: https://www.vgls.vic.gov.au/client/en_AU/search/asset/1159478/0 [Accessed 28 May 2022].

C Feng, H Wang, Lu, N., T Chen, H He, Lu, Y. and Tu, X. (2014). Log-transformation and its implications for data analysis. Shanghai archives of psychiatry, [online] 26(2), pp.105–109. doi:10.3969/j.issn.1002-0829.2014.02.009. Log-transformation and its implications for data analysis https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/