Analysis of the experimental data

Getting the data

yeast <- read.csv("yeast_uv_irradiation_v2.csv", stringsAsFactors = T)

Running 1-way ANOVA and Tukey HSD on the data

Null hypothesis: no statistical difference between the means of each treatment.

yeast.aov <- aov(cell_count ~ treatment, data = yeast)
summary(yeast.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)
## treatment    2  60091   30046   2.176  0.133
## Residuals   27 372808   13808
TukeyHSD(yeast.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cell_count ~ treatment, data = yeast)
## 
## $treatment
##                                  diff        lwr      upr     p adj
## irradiated_light-irradiated_dark -3.2 -133.49425 127.0942 0.9979579
## nonirradiated-irradiated_dark    93.3  -36.99425 223.5942 0.1968301
## nonirradiated-irradiated_light   96.5  -33.79425 226.7942 0.1770616

Since the p values are greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of no statistical difference between the means of each treatment.

Viewing the data

boxplot(cell_count ~ treatment, data = yeast, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast,
           add = T,
           vertical = T,
           pch = 20,
           method = "jitter",
           jitter = 0.1)

palette(c("dodgerblue", "firebrick", "seagreen"))
plot(cell_count ~ plate, data = yeast, col = treatment, pch = 20, cex = 1.5)
legend("topright", levels(yeast$treatment), pch = 20, col = 1:nlevels(yeast$treatment))

There is an outlier in the non-irradiated treatment - possibly due to experimental error.

By observing the comment “unusual shape of colonies” and the number of cells counted for the outlier, it suggests that either there was cross contamination of a foreign organism during plating or there was a dilution error causing the aggregation of colonies.

Given this comment and the extremity of the cell count, we will remove the outlier from the data set.

Removing the outlier

yeast.fix <- subset(yeast, cell_count < 300)
boxplot(cell_count ~ treatment, data = yeast.fix, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast.fix,
           add = T,
           vertical = T,
           pch = 20,
           method = "jitter",
           jitter = 0.1)

We can now observe the data set without outliers.

Based on the jitter of the plots, we can observe an uneven distribution of data within the irradiated_light treatment. By observing the comments of the irradiated_light treatment, we can further remove the 3 lowest yielding plates due to bacterial contamination and drying out.

Removing experimental errors

yeast.fix <- subset(yeast, cell_count < 300 & cell_count > 20)
boxplot(cell_count ~ treatment, data = yeast.fix, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast.fix,
           add = T,
           vertical = T,
           pch = 20,
           method = "jitter",
           jitter = 0.1)

We can now observe the data set without outliers and experimental errors.

Running 1-way ANOVA on the data without outliers and experimental errors

Null hypothesis: no statistical difference between the means of each treatment.

yeast.fix.aov <- aov(cell_count ~ treatment, data = yeast.fix)
summary(yeast.fix.aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## treatment    2   4685  2342.5   10.56 0.000559 ***
## Residuals   23   5104   221.9                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of no statistical difference between the means of each treatment.

However, we must test the remaining assumptions of ANOVA to see if our result is valid.

The two assumptions of ANOVA that remain to be tested are normal distribution and homogeneity of the variance.

Testing normal distribution (as presented by the residuals)

Null hypothesis: the data is normally distributed.

If the residuals are normally distributed, the distribution within the categories are also normal as residuals represent the remaining variation within each category.

shapiro.test(yeast.fix.aov$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  yeast.fix.aov$residuals
## W = 0.97251, p-value = 0.6891

Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of normal distribution of the data.

Testing homoscedasticity

To test the homoscedasticity, we must use the Bartlett’s test as there is more than two groups.

Null hypothesis: there is homoscedasticity within the data.

bartlett.test(cell_count ~ treatment, data = yeast.fix)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  cell_count by treatment
## Bartlett's K-squared = 11.241, df = 2, p-value = 0.003623

Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of homoscedasticity within the data.

However, homoscedasticity is an assumption of ANOVA.

To better test homoscedasticity, we will log transform the data to decouple variance and mean.

Log transforming the cell count

yeast.fix$cell_count_log <- log(yeast.fix$cell_count)

Running 1-way ANOVA on log-transformed data

Null hypothesis: no statistical difference between the means of each treatment.

yeast.fix.aov_log <- aov(cell_count_log ~ treatment, data = yeast.fix)
summary(yeast.fix.aov_log)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## treatment    2 0.6606  0.3303   12.18 0.000247 ***
## Residuals   23 0.6239  0.0271                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of no statistical difference between the means of each treatment.

Testing normal distribution (as presented by the residuals)

Null hypothesis: the data is normally distributed.

shapiro.test(yeast.fix.aov_log$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  yeast.fix.aov_log$residuals
## W = 0.98111, p-value = 0.8966

Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of normal distribution of the data.

Testing homoscedasticity

Null hypothesis: there is homoscedasticity within the data.

bartlett.test(cell_count_log ~ treatment, data = yeast.fix)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  cell_count_log by treatment
## Bartlett's K-squared = 7.8579, df = 2, p-value = 0.01966

Since the p value is still lower than 0.05, there is still statistical significance. Therefore, we still reject the null hypothesis of homoscedasticity within the data.

Now, we must re-view our data.

Re-viewing the data

boxplot(cell_count ~ treatment, data = yeast.fix, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast.fix,
           add = T,
           vertical = T,
           pch = 20,
           method = "jitter",
           jitter = 0.1)

By re-viewing the data, we can observe large variation within the irradiated_light treatment that may be contributing to the lack of homoscedasticity. There is a lower and a higher cluster. By observing the comments, the difference between these two clusters is the weather and amount of sun exposure: cloudy vs. full sun.

To analyse the full potential of photoreactivation, we will remove the plate counts conditioned to cloudy weather.

Removing cloudy plate counts

yeast.fix <- subset(yeast, cell_count < 300 & cell_count > 20 & comment != "cloudy")
boxplot(cell_count ~ treatment, data = yeast.fix, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast.fix,
           add = T,
           vertical = T,
           pch = 20,
           method = "jitter",
           jitter = 0.1)

Running 1-way ANOVA on the data (without cloudy conditioned plates)

Null hypothesis: no statistical difference between the means of each treatment.

yeast.fix.aov <- aov(cell_count ~ treatment, data = yeast.fix)
summary(yeast.fix.aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## treatment    2   7023    3511   36.89 2.87e-07 ***
## Residuals   19   1809      95                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of no statistical difference between the means of each treatment.

Testing normal distribution (as presented by the residuals)

Null hypothesis: the data is normally distributed

shapiro.test(yeast.fix.aov$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  yeast.fix.aov$residuals
## W = 0.95527, p-value = 0.3999

Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of normal distribution of the data.

Testing homoscedasticity

Null hypothesis: there is homoscedasticity within the data.

bartlett.test(cell_count ~ treatment, data = yeast.fix)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  cell_count by treatment
## Bartlett's K-squared = 3.9988, df = 2, p-value = 0.1354

Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of homoscedasticity within the data.

Log transforming the cell count (without cloudy conditioned plates)

yeast.fix$cell_count_log <- log(yeast.fix$cell_count)

Running 1-way ANOVA on log-transformed data (without cloudy conditioned plates)

Null hypothesis: no statistical difference between the means of each treatment.

yeast.fix.aov_log <- aov(cell_count_log ~ treatment, data = yeast.fix)
summary(yeast.fix.aov_log)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## treatment    2 0.9389  0.4694   40.55 1.39e-07 ***
## Residuals   19 0.2200  0.0116                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of no statistical difference between the means of each treatment.

Testing normal distribution (as presented by the residuals)

Null hypothesis: the data is normally distributed.

shapiro.test(yeast.fix.aov_log$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  yeast.fix.aov_log$residuals
## W = 0.96135, p-value = 0.5169

Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of normal distribution of the data.

Testing homoscedasticity

Null hypothesis: there is homoscedasticity within the data.

bartlett.test(cell_count_log ~ treatment, data = yeast.fix)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  cell_count_log by treatment
## Bartlett's K-squared = 1.511, df = 2, p-value = 0.4698

Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of homoscedasticity within the data.

Now that we have validated ANOVA, we can Tukey HSD to see the differences between each treatment.

Running Tukey HSD on log-transformed data (without cloudy conditioned plates)

TukeyHSD(yeast.fix.aov_log)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cell_count_log ~ treatment, data = yeast.fix)
## 
## $treatment
##                                        diff        lwr       upr     p adj
## irradiated_light-irradiated_dark  0.5165093  0.3365710 0.6964476 0.0000019
## nonirradiated-irradiated_dark     0.3657272  0.2401335 0.4913210 0.0000015
## nonirradiated-irradiated_light   -0.1507821 -0.3330126 0.0314485 0.1160792

Since some p values are lower than 0.05, there is statistically significant difference between certain treatments.

There is statistical difference between irradiated_light and irradiated_dark.

There is statistical difference between nonirradiated and irradiated_dark.

There is no statistical difference between nonirradiated and irradiated_light.

Combining boxplots for better comparison

par(mfrow = c(2, 1))
par(mar = c(5, 5, 2, 5))
yeast.box1 <- subset(yeast, cell_count < 300 & cell_count > 20)
yeast.box2 <- subset(yeast, cell_count < 300 & cell_count > 20 & comment != "cloudy")
boxplot(cell_count ~ treatment, data = yeast.box1, 
        col = c("dodgerblue", "firebrick", "seagreen"),
        main = "Boxplot without outliers and experimental errors", 
        cex.main = 0.8)
stripchart(cell_count ~ treatment, data = yeast.box1,
           add = T,
           vertical = T,
           pch = 20,
           method = "jitter",
           jitter = 0.1)
boxplot(cell_count ~ treatment, data = yeast.box2, 
        col = c("dodgerblue", "firebrick", "seagreen"),
        main = "Boxplot without outliers, experimental errors, and cloudy conditions",
        cex.main = 0.8)
stripchart(cell_count ~ treatment, data = yeast.box2,
           add = T,
           vertical = T,
           pch = 20,
           method = "jitter",
           jitter = 0.1)

Mean and std. dev. plot (EXTRA stats)

yeast.fix.agr <- merge(aggregate(cell_count ~ treatment, data = yeast.fix, mean),
                       aggregate(cell_count ~ treatment, data = yeast.fix, sd),
                       by = "treatment",
                       suffixes = c(".mean", ".sd"))
yeast.fix.agr
##          treatment cell_count.mean cell_count.sd
## 1  irradiated_dark        68.60000      6.363088
## 2 irradiated_light       114.66667      7.023769
## 3    nonirradiated        99.22222     12.968980
library(ggplot2)
ggplot(yeast.fix.agr, aes(x = treatment, y = cell_count.mean)) +
  geom_errorbar(aes(ymin = cell_count.mean - cell_count.sd, 
                    ymax = cell_count.mean + cell_count.sd,
                    width = 0.1)) +
  geom_point(size = 3, col = "firebrick") +
  geom_violin(data = yeast.fix, mapping = aes(x = treatment, y = cell_count),
              fill = NA,
              colour = "firebrick") +
  theme_bw()