yeast <- read.csv("yeast_uv_irradiation_v2.csv", stringsAsFactors = T)
Null hypothesis: no statistical difference between the means of each treatment.
yeast.aov <- aov(cell_count ~ treatment, data = yeast)
summary(yeast.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 60091 30046 2.176 0.133
## Residuals 27 372808 13808
TukeyHSD(yeast.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cell_count ~ treatment, data = yeast)
##
## $treatment
## diff lwr upr p adj
## irradiated_light-irradiated_dark -3.2 -133.49425 127.0942 0.9979579
## nonirradiated-irradiated_dark 93.3 -36.99425 223.5942 0.1968301
## nonirradiated-irradiated_light 96.5 -33.79425 226.7942 0.1770616
Since the p values are greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of no statistical difference between the means of each treatment.
boxplot(cell_count ~ treatment, data = yeast, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast,
add = T,
vertical = T,
pch = 20,
method = "jitter",
jitter = 0.1)
palette(c("dodgerblue", "firebrick", "seagreen"))
plot(cell_count ~ plate, data = yeast, col = treatment, pch = 20, cex = 1.5)
legend("topright", levels(yeast$treatment), pch = 20, col = 1:nlevels(yeast$treatment))
There is an outlier in the non-irradiated treatment - possibly due to experimental error.
By observing the comment “unusual shape of colonies” and the number of cells counted for the outlier, it suggests that either there was cross contamination of a foreign organism during plating or there was a dilution error causing the aggregation of colonies.
Given this comment and the extremity of the cell count, we will remove the outlier from the data set.
yeast.fix <- subset(yeast, cell_count < 300)
boxplot(cell_count ~ treatment, data = yeast.fix, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast.fix,
add = T,
vertical = T,
pch = 20,
method = "jitter",
jitter = 0.1)
We can now observe the data set without outliers.
Based on the jitter of the plots, we can observe an uneven distribution of data within the irradiated_light treatment. By observing the comments of the irradiated_light treatment, we can further remove the 3 lowest yielding plates due to bacterial contamination and drying out.
yeast.fix <- subset(yeast, cell_count < 300 & cell_count > 20)
boxplot(cell_count ~ treatment, data = yeast.fix, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast.fix,
add = T,
vertical = T,
pch = 20,
method = "jitter",
jitter = 0.1)
We can now observe the data set without outliers and experimental errors.
Null hypothesis: no statistical difference between the means of each treatment.
yeast.fix.aov <- aov(cell_count ~ treatment, data = yeast.fix)
summary(yeast.fix.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 4685 2342.5 10.56 0.000559 ***
## Residuals 23 5104 221.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of no statistical difference between the means of each treatment.
However, we must test the remaining assumptions of ANOVA to see if our result is valid.
The two assumptions of ANOVA that remain to be tested are normal distribution and homogeneity of the variance.
Null hypothesis: the data is normally distributed.
If the residuals are normally distributed, the distribution within the categories are also normal as residuals represent the remaining variation within each category.
shapiro.test(yeast.fix.aov$residuals)
##
## Shapiro-Wilk normality test
##
## data: yeast.fix.aov$residuals
## W = 0.97251, p-value = 0.6891
Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of normal distribution of the data.
To test the homoscedasticity, we must use the Bartlett’s test as there is more than two groups.
Null hypothesis: there is homoscedasticity within the data.
bartlett.test(cell_count ~ treatment, data = yeast.fix)
##
## Bartlett test of homogeneity of variances
##
## data: cell_count by treatment
## Bartlett's K-squared = 11.241, df = 2, p-value = 0.003623
Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of homoscedasticity within the data.
However, homoscedasticity is an assumption of ANOVA.
To better test homoscedasticity, we will log transform the data to decouple variance and mean.
yeast.fix$cell_count_log <- log(yeast.fix$cell_count)
Null hypothesis: no statistical difference between the means of each treatment.
yeast.fix.aov_log <- aov(cell_count_log ~ treatment, data = yeast.fix)
summary(yeast.fix.aov_log)
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 0.6606 0.3303 12.18 0.000247 ***
## Residuals 23 0.6239 0.0271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of no statistical difference between the means of each treatment.
Null hypothesis: the data is normally distributed.
shapiro.test(yeast.fix.aov_log$residuals)
##
## Shapiro-Wilk normality test
##
## data: yeast.fix.aov_log$residuals
## W = 0.98111, p-value = 0.8966
Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of normal distribution of the data.
Null hypothesis: there is homoscedasticity within the data.
bartlett.test(cell_count_log ~ treatment, data = yeast.fix)
##
## Bartlett test of homogeneity of variances
##
## data: cell_count_log by treatment
## Bartlett's K-squared = 7.8579, df = 2, p-value = 0.01966
Since the p value is still lower than 0.05, there is still statistical significance. Therefore, we still reject the null hypothesis of homoscedasticity within the data.
Now, we must re-view our data.
boxplot(cell_count ~ treatment, data = yeast.fix, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast.fix,
add = T,
vertical = T,
pch = 20,
method = "jitter",
jitter = 0.1)
By re-viewing the data, we can observe large variation within the irradiated_light treatment that may be contributing to the lack of homoscedasticity. There is a lower and a higher cluster. By observing the comments, the difference between these two clusters is the weather and amount of sun exposure: cloudy vs. full sun.
To analyse the full potential of photoreactivation, we will remove the plate counts conditioned to cloudy weather.
yeast.fix <- subset(yeast, cell_count < 300 & cell_count > 20 & comment != "cloudy")
boxplot(cell_count ~ treatment, data = yeast.fix, col = c("dodgerblue", "firebrick", "seagreen"))
stripchart(cell_count ~ treatment, data = yeast.fix,
add = T,
vertical = T,
pch = 20,
method = "jitter",
jitter = 0.1)
Null hypothesis: no statistical difference between the means of each treatment.
yeast.fix.aov <- aov(cell_count ~ treatment, data = yeast.fix)
summary(yeast.fix.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 7023 3511 36.89 2.87e-07 ***
## Residuals 19 1809 95
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of no statistical difference between the means of each treatment.
Null hypothesis: the data is normally distributed
shapiro.test(yeast.fix.aov$residuals)
##
## Shapiro-Wilk normality test
##
## data: yeast.fix.aov$residuals
## W = 0.95527, p-value = 0.3999
Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of normal distribution of the data.
Null hypothesis: there is homoscedasticity within the data.
bartlett.test(cell_count ~ treatment, data = yeast.fix)
##
## Bartlett test of homogeneity of variances
##
## data: cell_count by treatment
## Bartlett's K-squared = 3.9988, df = 2, p-value = 0.1354
Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of homoscedasticity within the data.
yeast.fix$cell_count_log <- log(yeast.fix$cell_count)
Null hypothesis: no statistical difference between the means of each treatment.
yeast.fix.aov_log <- aov(cell_count_log ~ treatment, data = yeast.fix)
summary(yeast.fix.aov_log)
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 0.9389 0.4694 40.55 1.39e-07 ***
## Residuals 19 0.2200 0.0116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p value is lower than 0.05, there is statistical significance. Therefore, we reject the null hypothesis of no statistical difference between the means of each treatment.
Null hypothesis: the data is normally distributed.
shapiro.test(yeast.fix.aov_log$residuals)
##
## Shapiro-Wilk normality test
##
## data: yeast.fix.aov_log$residuals
## W = 0.96135, p-value = 0.5169
Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of normal distribution of the data.
Null hypothesis: there is homoscedasticity within the data.
bartlett.test(cell_count_log ~ treatment, data = yeast.fix)
##
## Bartlett test of homogeneity of variances
##
## data: cell_count_log by treatment
## Bartlett's K-squared = 1.511, df = 2, p-value = 0.4698
Since the p value is greater than 0.05, there is no statistical significance. Therefore, we retain the null hypothesis of homoscedasticity within the data.
Now that we have validated ANOVA, we can Tukey HSD to see the differences between each treatment.
TukeyHSD(yeast.fix.aov_log)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cell_count_log ~ treatment, data = yeast.fix)
##
## $treatment
## diff lwr upr p adj
## irradiated_light-irradiated_dark 0.5165093 0.3365710 0.6964476 0.0000019
## nonirradiated-irradiated_dark 0.3657272 0.2401335 0.4913210 0.0000015
## nonirradiated-irradiated_light -0.1507821 -0.3330126 0.0314485 0.1160792
Since some p values are lower than 0.05, there is statistically significant difference between certain treatments.
There is statistical difference between irradiated_light and irradiated_dark.
There is statistical difference between nonirradiated and irradiated_dark.
There is no statistical difference between nonirradiated and irradiated_light.
par(mfrow = c(2, 1))
par(mar = c(5, 5, 2, 5))
yeast.box1 <- subset(yeast, cell_count < 300 & cell_count > 20)
yeast.box2 <- subset(yeast, cell_count < 300 & cell_count > 20 & comment != "cloudy")
boxplot(cell_count ~ treatment, data = yeast.box1,
col = c("dodgerblue", "firebrick", "seagreen"),
main = "Boxplot without outliers and experimental errors",
cex.main = 0.8)
stripchart(cell_count ~ treatment, data = yeast.box1,
add = T,
vertical = T,
pch = 20,
method = "jitter",
jitter = 0.1)
boxplot(cell_count ~ treatment, data = yeast.box2,
col = c("dodgerblue", "firebrick", "seagreen"),
main = "Boxplot without outliers, experimental errors, and cloudy conditions",
cex.main = 0.8)
stripchart(cell_count ~ treatment, data = yeast.box2,
add = T,
vertical = T,
pch = 20,
method = "jitter",
jitter = 0.1)
yeast.fix.agr <- merge(aggregate(cell_count ~ treatment, data = yeast.fix, mean),
aggregate(cell_count ~ treatment, data = yeast.fix, sd),
by = "treatment",
suffixes = c(".mean", ".sd"))
yeast.fix.agr
## treatment cell_count.mean cell_count.sd
## 1 irradiated_dark 68.60000 6.363088
## 2 irradiated_light 114.66667 7.023769
## 3 nonirradiated 99.22222 12.968980
library(ggplot2)
ggplot(yeast.fix.agr, aes(x = treatment, y = cell_count.mean)) +
geom_errorbar(aes(ymin = cell_count.mean - cell_count.sd,
ymax = cell_count.mean + cell_count.sd,
width = 0.1)) +
geom_point(size = 3, col = "firebrick") +
geom_violin(data = yeast.fix, mapping = aes(x = treatment, y = cell_count),
fill = NA,
colour = "firebrick") +
theme_bw()