The main benefit of studying community assembly, and factors that
impact it, is to further understand the connection of community
diversity to ecosystem functionality, which is crucial in upkeeping
ecosystems. With spatial distance shown to be a significant factor in
community assembly processes, it is sensible to explore the waterbodies
of the Lunz study site, which consists of various environments with
different conditions, and includes three springs, seven creeks, four
groundwater wells, a pond, a hyporheic spring, and three lakes. The
connectivity is presumed to be hindered at the middle point of the
catchment area, the Middle Lake.
Package loading
library(FEAST) # for source tracking
## Warning: replacing previous import 'dplyr::combine' by 'gridExtra::combine'
## when loading 'FEAST'
library(tidyverse)
library(vegan)
library(FSA) # for dunn post-hoc test
library(caret)
library(broom)
library(patchwork) # for combining plots into 1
library(ggfortify)
library(pairwiseAdonis) # for permanova post-hoc
From every sample, DNA was extracted and used for 16S rRNA gene
amplicon sequencing. Amplicon sequencing and bioinformatic processing of
the raw sequence data for the inference of amplicon sequence variants
(ASV) using DADA2 was done, as well as other processing of organelles
and mitochondria.
Load data
# load metadata
metadata <- read.csv("~/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/data/metadata_Lunz_revised.csv",
row.names = 1)
# load asv table
asv_table <- read.csv("~/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/data/asv_table.csv", row.names=1)
# identifier need to be identical between asv_table and metadata
identical(rownames(metadata), colnames(asv_table))
# if there is some that are different use setdiff(rownames(metadata), colnames(asv_table)) to see which
# data on sources/sinks
feast_metadata <- read.csv("~/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/data/feast_metadata.csv", row.names=1)
# identifiers for sources have to be unique - source 1 needs to be Source1_1, Source1_2, ... )
# SampleID column need to be rownames
FEAST
FEAST - the fast expectation-maximization was used, as a more
computational efficient and shorter alternative to SourceTracker.
Prepare the data for FEAST.
# classes of datasets
# feast_metadata should be a data.frame
class(feast_metadata)
#asv should be matrix and each column an ineger in order to perform feast
feast_asv <- as.matrix(asv_table) # matrix as a whole
for(i in 1:ncol(feast_asv)){ # each column as integer
feast_asv[, i] <- as.integer(feast_asv[, i])
}
# whole table should be a matrix
class(feast_asv)
# individual column should be integer
class(feast_asv[, 1])
# final step - form needs to be as with vegan (sites rows, species columns)
feast_asv <- t(feast_asv)
feast_asv[1:10, 1:5]
#to remove from environment
rm(i)
Perform FEAST. It exports the results and later on needs to be
imported again.
The metadata table has three columns (i.e., ‘Env’, ‘SourceSink’,
‘id’). The first column is a description of the sampled environment
(e.g., sample names), the second column indicates if this sample is a
source or a sink (can take the value ‘Source’ or ‘Sink’). The third
column is the Sink-Source id.
When using multiple sinks, only the rows with ‘SourceSink’ = Sink
need an id (between 1 - number of sinks in the data), the sources’ ids
are blank.
feast_output <- FEAST(C = feast_asv, # transposed asv_table
metadata = feast_metadata, # sink or source information
different_sources_flag = 0, # 0 if sources are not assigned to different sinks
EM_iterations = 1000, # number of iterations
dir_path = "C:/Users/Angela Cukusic/Documents/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/Results",
outfile = "feast_results") # prefix for saving output file
# COVERAGE should be included in the command if we want to indicate the rarefaction depth
#we got an output, import it now
feast_table <- read.table("C:/Users/Angela Cukusic/Documents/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/Results/feast_results_source_contributions_matrix.txt",
sep = "\t", row.names = 1)
Raw contribution to
plottable data
To get an overview of the total contribution of each source
(i.e. groups of samples) to the different sink we have to clean it.
# feast_output is of class "list"
# we want to make it a matrix
# 1. make a new vector with source types (and it is "character")
sources <- paste("source", c(1:8), sep = "")
# 2. make empty matrix with nrow = number of SINKS (5),
# and ncol = number of possible sources (20+ samples are designated to 1-8 sources)
source_contributions <- matrix(NA, # empty
nrow = nrow(feast_table), # 5 sinks
ncol = length(sources)) # n sources
# 3. fill matrix
for(i in 1:length(sources)){ # for each source add contribution
source_contributions[, i] <- rowSums(feast_table[, # sum up contributions for all 5 rows
grepl(sources[i], colnames(feast_table))])
# of only the columns which have sourceX in name
}
# 4. we added contributions assigned to each of 8 sources, now add from unknown sources
source_contributions <- cbind(source_contributions, feast_table$Unknown)
# 5. check if all contributuons for individual sink amount to 1
rowSums(source_contributions)
# 6. convert to data frame and bring back rownames and colnames
source_contributions <- as.data.frame(source_contributions)
colnames(source_contributions) <- c(sources, "unknown") # col
source_contributions <- source_contributions %>% # row
mutate(sample_id = rownames(feast_table), .before = 1)
# final clean up of environment
rm( sources, i)
Plot source
contribution
source_contributions %>%
pivot_longer(cols = -sample_id, names_to = "sources", values_to = "contributions") %>%
mutate(sources = as.factor(sources)) %>%
# find relative abundances for each date in each water kind (well/pump/surf)
group_by(sample_id) %>%
mutate(rel_cont = contributions / sum(contributions) * 100) %>%
ungroup() %>%
# plotting
ggplot(aes(x = sample_id, y = rel_cont, fill = sources)) +
geom_bar(stat = "identity") +
# labelling
geom_text(aes(label =
# label if % is big enough
ifelse(rel_cont > 2,
# if yes then paste, if no then empty ""
paste0(round(rel_cont, 4) , "%"), "" ) ),
# how to position it
position = position_stack(vjust = 0.5), size = 4) +
# theme
theme(legend.position = "none",
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank()) +
theme_classic() +
scale_fill_brewer(palette = "Set1") +
labs(x= "Sinks", y="Relative contributions %", fill = "Sources") +
scale_x_discrete(labels = c("Untersee \n(31m)",
"Untersee \n(5m)",
"Untersee - \nnear OSB inflow",
"Untersee - \nat OSB \ninflow",
"Sediment: \nUntersee - \nat OSB \ninflow"))

From five sink samples identified in this study, four were water
samples, and one a sediment sample. Each sink had a substantial portion
of microbes of unknown origin.
Additional to that, another visible trend is water samples being
traced back to Source 1 (surface waters), with non-existent or minimal
portion of Source 2 related microbes (surface waters’ sediment and
biofilm samples, see Fig.2.).
The sediment sink, on the other hand, showed no influence of Source 1
on its composition, but had a high Source 2 origin. Water sinks
contained more ASVs not found in any of the sources than the sediment
sink.
feast_table %>%
rownames_to_column("sample_id") %>%
# find the two sinks in question
filter(sample_id %in% c("OSB_in_LUS_inflow_LUS_at_OSB_inflow",
"OSB_in_LUS_sed_LUS_sediment")) %>%
# find the individual samples from source 6
select( "sample_id", contains("source6")) %>%
# remove the last part of names, which says "source6"
rename_all(~str_remove(., ".{10}$")) %>%
pivot_longer(cols = -sample_id, names_to = "sources", values_to = "contributions") %>%
mutate(sources = as.factor(sources)) %>%
# find relative abundances for each date in each water kind (well/pump/surf)
group_by(sample_id) %>%
mutate(rel_cont = contributions / sum(contributions) * 100) %>%
ungroup() %>%
# plotting
ggplot(aes(x = sample_id, y = rel_cont, fill = sources)) +
geom_bar(stat = "identity", color = "black") +
# labelling
geom_text(aes(label =
# label if % is big enough
ifelse(rel_cont > 2,
# if yes then paste, if no then empty ""
paste0(round(rel_cont, 4) , "%"), "" ) ),
# how to position it
position = position_stack(vjust = 0.5), size = 4) +
# theme
theme(legend.position = "none",
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank()) +
theme_classic() +
labs(x= "Sinks", y="Relative contributions %", fill = "Sources") +
scale_x_discrete(labels = c("Untersee - \nat OSB \ninflow",
"Sediment: \nUntersee - \nat OSB \ninflow")) +
scale_fill_manual(values = c( "#006400", "#FFD700", "lightblue", "#FFD700",
"lightblue", "#006400", "lightblue" ))

Even though the two sinks at the inflow both had ~10% of Source 6
microbes (which was two locations - both sediment and water), it seems
that more detailed look gives more insight that sample type is a better
predictor of contribution than location is.
Result: water sink was more influenced by water samples communities,
while sediment sink contained almost exclusively microbes from spring
sediments and spring biofilms.
Sample type
distribution
metadata_type <- metadata %>%
rownames_to_column("sources") %>%
select("sources", "type_III") %>%
mutate(type_III = case_when(
# assign categories
type_III %in% c("Creek", "Hyporheic",
"Pond", "Lake", "Inflow") ~ "Surface",
TRUE ~ type_III))
feast_table %>%
rownames_to_column("sink_id") %>%
# rename the columns so that they can be connected to the metadata (type info)
rename_all(~str_remove(., ".{10}$")) %>%
pivot_longer(cols = - sink_id, names_to = "sources", values_to = "contributions") %>%
# filter out contributions from unknow sources
filter(sources != "Unknown") %>%
# join with metadata
left_join(., metadata_type, by = "sources") %>%
filter(type_III != "NA") %>%
filter( contributions < 0.05) %>%
# plot
ggplot(aes(x= type_III, y= contributions)) +
geom_boxplot(aes(fill = sink_id) ) +
theme_linedraw() +
scale_fill_manual(labels=c("Untersee (31m)",
"Untersee (5m)",
"Untersee (near inflow)",
"Untersee (at inflow)",
"Sediment (at inflow)"),
values=c("#acc7d9", "#8bb0ca","#5180a2","#3e647d","#d8b365")) +
theme(legend.position=c(0.3,0.8),
legend.title = element_blank(),
legend.text = element_text(size = 14),
legend.background = element_rect(),
legend.box.background = element_rect(color = "gray")) +
labs(y = "Contributions", x= "Sample type") +
scale_x_discrete(labels = c("Biofilm", "Ground water", "Sediment",
"Soil", "Spring water", "Surface water") )

Surface waters contributed to all water sinks, but their effect on
the sediment of the sink was minimal. Samples that did contribute to
sediment sink are biofilm, sediment and soil source samples. Additional
to contributing to the sediment sink, biofilm samples also had an impact
on the community at the inflow of the Lower Lake. Springwater samples
were the biggest contributor to the inflow, and had an impact on the
microbial assembly of the Lower Lake at its 31-meter depth.
Spatial
distribution
metadata_rank <- metadata %>%
rownames_to_column("sources") %>%
select("sources", "rank") %>%
filter(rank != "sink")
feast_table %>%
rownames_to_column("sink_id") %>%
# rename the columns so that they can be connected to the metadata (rank info)
rename_all(~str_remove(., ".{10}$")) %>%
pivot_longer(cols = - sink_id, names_to = "sources", values_to = "contributions") %>%
# filter out contributions from unknow sources
filter(sources != "Unknown") %>%
# join with metadata
left_join(., metadata_rank, by = "sources") %>%
filter(rank != "NA") %>%
# plot
ggplot(aes(x= rank, y= contributions)) +
geom_boxplot( fill = "violet") +
facet_wrap(~sink_id, scales = "free_y", labeller =
as_labeller( c("LUS_dp_31m_LUS_31m" = "Untersee (31m)",
"LUS_dp_5m_LUS_5m" = "Untersee (5m)",
"LUS_in_OSB_LUS_near_OSB_inflow" = "Untersee (near inflow)",
"OSB_in_LUS_inflow_LUS_at_OSB_inflow" = "Untersee (at inflow)",
"OSB_in_LUS_sed_LUS_sediment" = "Sediment (at inflow)" ))) +
theme_linedraw() +
labs(y = "Contributions", x= "Ranked by spatial distance ")

We hypothesized that a substantial effect of passive dispersal of
microorganisms on the composition of microbial communities in the lake
sinks should result in significantly declining contributions of sources
with increasing spatial distance to the main lake (i.e. distance ranks
since data on exact spatial distances were not available).
Tackling this hypothesis, we did not find a significant correlation
between spatial distance and sink contribution. None of the sinks
followed the presumed trend of rank 1 having the highest sink
contribution and rank 8 the lowest, nor was there a difference between
the ranks before the connectivity barrier and after the barrier. The
rank 6 is upstream of the Middle lake, which is the presumed
barrier.
Alpha diversity
shannondiv <- vegan::diversity(t(asv_table))
# recalculating it as the effective number of species to be more comparable across studies.
ens <- exp(shannondiv) %>% as.data.frame() %>% rownames_to_column("sample_id") %>%
rename("diversity" = ".")
diversity_data <- specnumber(t(asv_table)) %>% as.data.frame() %>% rownames_to_column("sample_id") %>%
rename("richness" = ".") %>% inner_join(., ens, by ="sample_id" )
# we will plot diversity by source so make clean categorization here
feast_metadata %>%
# remove last two strings
mutate(Env = str_remove(Env, ".{2}$")) %>%
# where SourceSink column is a source keep the value from the Env column
mutate(Source_category = case_when(SourceSink == "Source" ~ Env,
# otherwise keep the value from SourceSink column
TRUE ~ SourceSink) ) %>%
# some trouble with source 3
mutate(Source_category = case_when(Source_category %in% c("source3", "source3_") ~ "source3",
TRUE ~Source_category)) %>%
rownames_to_column("sample_id") %>%
select("sample_id", "Source_category") %>%
mutate(Source_category = as.factor(Source_category)) %>%
# we now have clean category and can connect it to the diversity data
inner_join(., diversity_data, by = "sample_id") %>%
select(-sample_id) %>%
#pivot longer so we can facet
pivot_longer(cols = -Source_category, names_to = "measure", values_to = "values") %>%
# plot
ggplot( aes(x = Source_category, y = values)) +
#stat_boxplot(geom ='errorbar') +
geom_boxplot(fill = "violet") +
theme_linedraw()+
coord_flip() +
facet_wrap(~measure, scales = "free", labeller =
as_labeller( c("diversity" = "Shannon's diversity index",
"richness" = "Richness") ) )+
theme(legend.position = "none",
axis.title=element_text(size=14,face="bold"),
axis.text.y = element_text(size=10,face="bold"),
strip.text = element_text(size = 14)) +
labs( x = "Source category",
y = NULL) +
scale_x_discrete(labels = c("Sink", "Source 1", "Source 2", "Source 3", "Source 4",
"Source 5", "Source 6", "Source 7", "Source 8"))

Richness of the sinks was the lowest compared to source richness.
Sources 7, and 8, which have a direct inflow to the sinks, had the
highest richness. Source 1, which makes a big contribution to sink
communities had higher richness than the sinks it contributed to.
Sink evenness was lower than evenness of sources. Species
distribution of the sink leans towards a few very dominant ASVs and many
rare ones, indicating possible harder environmental conditions or bigger
competitiveness in sink. Source 3, with its soil samples, had high
evenness. Source 4 (groundwater) had slightly lower evenness and
richness in comparison to other sources.
Test it:
# 1. the same data manipulation as above
feast_metadata %>%
mutate(Env = str_remove(Env, ".{2}$")) %>%
mutate(Source_category = case_when(SourceSink == "Source" ~ Env,
TRUE ~ SourceSink) ) %>%
mutate(Source_category = case_when(Source_category %in% c("source3", "source3_") ~ "source3",
TRUE ~Source_category)) %>%
rownames_to_column("sample_id") %>%
select("sample_id", "Source_category") %>%
mutate(Source_category = as.factor(Source_category)) %>%
inner_join(., diversity_data, by = "sample_id") %>%
select(-sample_id) -> data_for_alpha
# 2. perform ANOVA/ kruskal-wallis
# 1. assumption of ANOVA: Normal distribution:
# Each population to be compared should be normally distributed
# (NOT THE ENTIRE DATASET).
shapiro.test ( data_for_alpha[which(data_for_alpha$Source_category == "source1"), "diversity"] ) # NOT ND
shapiro.test ( data_for_alpha[which(data_for_alpha$Source_category == "source2"), "diversity"] ) # ND
shapiro.test ( data_for_alpha[which(data_for_alpha$Source_category == "source3"), "diversity"] ) # ND
shapiro.test ( data_for_alpha[which(data_for_alpha$Source_category == "source4"), "diversity"] ) # NOT ND
shapiro.test ( data_for_alpha[which(data_for_alpha$Source_category == "source5"), "diversity"] ) # NOT ND
shapiro.test ( data_for_alpha[which(data_for_alpha$Source_category == "source6"), "diversity"] ) # ND
# source 7 is too small sample size to perform the test
shapiro.test ( data_for_alpha[which(data_for_alpha$Source_category == "source8"), "diversity"] ) # ND
shapiro.test ( data_for_alpha[which(data_for_alpha$Source_category == "Sink"), "diversity"] ) # ND
# A: would be better to do a non-parametric test
## 2. assumption of ANOVA: Homogeneity of variance:
## Variance in the populations compared should be the same/similar.
bartlett.test ( diversity ~ Source_category, data= data_for_alpha )
# assumption met
# 3. test
kruskal.test (diversity ~ Source_category, data= data_for_alpha)
# 4. post-hoc
FSA::dunnTest(diversity ~ Source_category, data= data_for_alpha)
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Holm method.
Even though one assumption is met (hom.var.) we should not use ANOVA
- the assumption of homogeneity of variance is an extension of the
assumption of normality. It would not be feasible to compare a skewed
distribution in one group to a normal distribution in another group,
even if they have comparable variances!
Kruskal-Wallis test confirms differences in shannon’s index. We do
post-hoc diagnosing to see which groups are different with a Dunn test.
The difference which is statistically sound is the diversity levels
between the source 3 samples and the Sink, after adjusting the
p-value.
Since source 3 consists of soil samples, it is not unexpected that it
would have higher diversity.
Beta diversity
bc_diss <- vegdist(t(asv_table), method = "bray")
nmds <- metaMDS(bc_diss)
# make palette for plotting
pal = c("green4", "navy", "orange", "brown", "blue", "lightblue")
# lets plot an nmds where the color is sample type, shape is source sink
as.data.frame(nmds$points) %>%
cbind(data_for_alpha$Source_category, metadata_type$type_III) %>%
rename("source_cat" = "data_for_alpha$Source_category",
"type" = "metadata_type$type_III") %>%
mutate(source_cat = case_when(source_cat == "Sink" ~ "sink",
TRUE ~ "source")) -> data_for_nmds
# plot
data_for_nmds %>%
ggplot(aes(x = MDS1, y = MDS2)) +
geom_point(aes(fill = type, color = type,
shape= source_cat),
size = 5, stroke=1, alpha = 0.6) +
# customizing the plot
scale_fill_manual(values = pal) +
scale_color_manual(values = pal) +
scale_shape_manual(values = c(21,23)) +
theme_linedraw() +
annotate(geom="text",x=-1.4, y=1.5, label="stress =", color="black") +
annotate(geom="text", x=-1.1, y=1.5, label=round(nmds$stress,4),
color="black") +
#stat_ellipse(aes(colour = type), linewidth = 0.3) +
#stat_ellipse(aes(fill = type), geom="polygon",level=0.95,alpha=0.04) +
theme(legend.background = element_blank(),
legend.box.background = element_rect(colour = "gray")) +
labs(color = "Water \ntype", shape = "Source") +
guides(fill = FALSE) +
geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
geom_hline(yintercept = c(0), color = "grey70", linetype = 2) -> plot_nmds
There was clear grouping of soil and sediment communities. Surface
water, spring water and biofilm communities were more spread out, and
more dissimilar to soil than to sediment communities.
Out of five sinks, four water and one sediment sink, the most similar
to sediment source samples, in terms of community assembly, was the
sediment sink. It is also closer to the soil sources, whereas water
sinks communities showed little similarity to soil sources.

Environmental data
pca (environmenta
variables of water communities)
df_pca <- metadata %>%
# we only have measurements for water, not for soil nor sediment
filter(type_I == "Water") %>%
select(where(is.numeric)) %>%
select(-c("filter", # dont need it
"N_NO2", "P_PO4")) # NAs
# remove the highly correlated ones, like internalATP and totalATP
cor_matrix <- cor(df_pca)
highly_correlated <- caret::findCorrelation(cor_matrix, cutoff = 0.9)
df_pca <- df_pca[, -highly_correlated]
# now there is 22 observations and 10 environmental variables
# lets see if any env vars can be packed neatly in PCs
pca_env <- prcomp(df_pca, scale = TRUE)
biplot(pca_env)

# i believe it is beneficial to remove the sample PX_gw since it is a single outlier in almost every variable, and stresses the ordination
pca_env <- prcomp(df_pca[rownames(df_pca) != "PX_gw",], scale = TRUE)
# we need the column for coloring by type, get it from the original metadata
metadata[rownames(df_pca) != "PX_gw", c(colnames(df_pca), "type_III", "type_I")] %>%
# remove the outlier sample, the columns that are highly correlated and soil/sed samples
filter(type_I == "Water") -> data_for_pca
autoplot(pca_env, data =data_for_pca ,
colour = 'type_III',
size=5, loadings=TRUE,
loadings.colour="black",
loadings.label=TRUE, loadings.label.size=5,
loadings.label.colour="black" )+
geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
geom_point(shape= 21, color = "black", size = 5) +
theme_linedraw() +
geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
scale_color_brewer(palette = "Accent")+
theme(legend.background = element_blank(),
legend.box.background = element_rect(colour = "gray")) -> pca_plot
pca_plot

The first three components have 1+ variance explanation. See more
detailed what PC1, PC2 and PC3 (as combinations of environmental
variables) consist of.
normal data: variance = (st.dev)^2 pca data: eigenvalue = (st.dev)^2
“eigenvalues represent the variance explained by each principal
component” -> eigen ~ variance
# Extract the loadings from the PCA result and convert to tidy format
loadings_df <- tidy(pca_env, matrix = "rotation")
# PC1
# Create the ggplot2 object for plotting the loadings
loadings_df %>% filter(PC == "1") %>%
#x=reorder(class,-amount,sum)
ggplot( aes(x = reorder(column, -value), y = value, group = factor(column))) +
geom_bar(stat = "identity", position = "dodge",
fill= "gray90", color = "gray20") +
geom_text(aes(label = column), position = position_dodge(width = 0.9),
#vjust = "inward",
angle = 90 , hjust = "inward") +
labs(x = "Variables of PC1", y = "Loadings") +
theme_minimal() +
theme(axis.text.x = element_blank(),
legend.position = "none") -> plot_pca1
# PC2
loadings_df %>% filter(PC == "2") %>%
#x=reorder(class,-amount,sum)
ggplot( aes(x = reorder(column, -value), y = value, group = factor(column))) +
geom_bar(stat = "identity", position = "dodge",
fill= "gray90", color = "gray20") +
geom_text(aes(label = column), position = position_dodge(width = 0.9),
#vjust = "inward",
angle = 90 , hjust = "inward") +
labs(x = "Variables of PC2", y = "Loadings") +
theme_minimal() +
theme(axis.text.x = element_blank(),
legend.position = "none") -> plot_pca2
# PC3
loadings_df %>% filter(PC == "3") %>%
#x=reorder(class,-amount,sum)
ggplot( aes(x = reorder(column, -value), y = value, group = factor(column))) +
geom_bar(stat = "identity", position = "dodge",
fill= "gray90", color = "gray20") +
geom_text(aes(label = column), position = position_dodge(width = 0.9),
#vjust = "inward",
angle = 90 , hjust = "inward") +
labs(x = "Variables of PC3", y = "Loadings") +
theme_minimal() +
theme(axis.text.x = element_blank(),
legend.position = "none") -> plot_pca3
plot_pca1 + plot_pca2 + plot_pca3

Test pca dispersions.
# we can check the significance in difference between groups in both pca and nmds
euclidean_dist <- data_for_pca %>% select(-c("type_I" , "type_III")) %>%
scale() %>% as.data.frame() %>%
dist(., method = "euclidean")
disp_pca <- betadisper( euclidean_dist, data_for_pca$type_III )
anova(disp_pca) #no sign. differences
## Analysis of Variance Table
##
## Response: Distances
## Df Sum Sq Mean Sq F value Pr(>F)
## Groups 6 22.11 3.685 1.7759 0.1762
## Residuals 14 29.05 2.075
disp_pca$distances %>% as.data.frame() %>%
rename("Distance" = ".") %>%
rownames_to_column("sample_id") %>%
cbind(data_for_pca) %>%
# plot
ggplot( aes(x = type_III, y = Distance)) +
geom_boxplot( fill = "violet") +
labs(x = NULL, y = "Environmental differences \nstandardized Euclidean distance") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 10),
axis.title.y = element_text(size = 10)) -> pca_disp
nmds (water
communities)
asv_water <- asv_table [,rownames(df_pca)] %>% select(!"PX_gw")
bc_diss_water <- vegdist(t(asv_water), method = "bray")
nmds_water <- metaMDS(bc_diss_water)
as.data.frame(nmds_water$points) %>%
cbind(data_for_pca$type_III) %>%
rename("type" = "data_for_pca$type_III") %>%
# plot
ggplot(aes(x = MDS1, y = MDS2)) +
geom_point(aes(fill = type), shape = 21,
size = 5, color= "black" ) +
# customizing the plot
scale_fill_brewer(palette = "Accent") +
theme_linedraw() +
annotate(geom="text",x=-1.4, y=1.5, label="stress =", color="black") +
annotate(geom="text", x=-1.1, y=1.5, label=round(nmds_water$stress,4),
color="black") +
theme(legend.background = element_blank(),
legend.box.background = element_rect(colour = "gray")) +
labs(color = "Water \ntype", shape = "Source") +
guides(fill = FALSE) +
geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
geom_hline(yintercept = c(0), color = "grey70", linetype = 2) -> nmds_plot_water
euclidean_dist <- data_for_pca %>% select(-c("type_I" , "type_III")) %>%
scale() %>% as.data.frame() %>%
dist(., method = "euclidean")
disp_nmds <- betadisper( bc_diss_water, data_for_pca$type_III )
anova(disp_nmds) # sign. differences
disp_nmds$distances %>% as.data.frame() %>%
rename("Distance" = ".") %>%
rownames_to_column("sample_id") %>%
cbind(data_for_pca) %>%
# plot
ggplot( aes(x = type_III, y = Distance)) +
geom_boxplot( fill = "violet") +
labs(x = NULL, y = "Bray-curtis distance") +
theme_linedraw() +
theme(axis.text.x = element_text(size = 10),
axis.title.y = element_text(size = 10)) -> nmds_water_disp
nmds_plot_water + pca_plot +
nmds_water_disp + pca_disp

## Analysis of Variance Table
##
## Response: Distances
## Df Sum Sq Mean Sq F value Pr(>F)
## Groups 6 0.48932 0.081554 3.0834 0.03869 *
## Residuals 14 0.37029 0.026449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lake communities and groundwater communties seem to have relatively
distinct microbial composition. The hyporheic, pond and inflow water
consist only of one sample point, when we remove the soil and sediment
samples for this analysis, so it is not possible to draw conclusions.
But the rest of the samples have a similar relation for both the
environmental variaton and the community variation - environmental
homogeneity of the groundwater samples in visible in the low dispersion
of the groundwater microbial communities. (check significance and put on
plot later)
Constrained
ordiantion
CAP ordination as an distance-based alternative to the RDA (which
uses bray, as opposed to RDA’s euclidean). Hellinger transformation
(square root of relative abundances) is used to account for
compositionality of data.
data_for_pca %>% select(where(is.numeric)) %>%
scale() %>% as.data.frame() -> env.data
decostand(t(asv_water), method = "hellinger") -> sp.dat
# check
identical(rownames(env.data), colnames(asv_water)) #TRUE
# perform model
simpleRDA <- capscale(sp.dat ~ ., data=env.data ,
distance = "bray")
vif.cca(simpleRDA) # all variables lower than 10
# test
# test env paameters
anova.model3 <- anova(simpleRDA, step=1000, by = "term")
anova.model3
# sign: bix* , fi***, TCC***, Cond, O2**, Temp*
# new model
simpleRDA <- capscale(sp.dat ~ bix + fi + TCC + Cond + O2 + Temp, data=env.data ,
distance = "bray")
# test env paameters
anova.model3 <- anova(simpleRDA, step=1000, by = "term")
# test of all canonical axes
anova.model <- anova.cca(simpleRDA, by='axis', step=1000) # CAP1,CAP2,CAP3 significant
# test the whole model
anova.model2 <- anova.cca(simpleRDA, step=1000) # sign
RsquareAdj(simpleRDA)$adj.r.squared # explains 30%
## [1] TRUE
## bix fi TCC Cond O2 Temp pH tot.N
## 5.638966 3.408220 2.283686 2.525874 1.294785 1.745977 1.627837 4.961876
## N_NO3 tot.P
## 6.261860 5.577410
## Permutation test for capscale under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## Model: capscale(formula = sp.dat ~ bix + fi + TCC + Cond + O2 + Temp + pH + tot.N + N_NO3 + tot.P, data = env.data, distance = "bray")
## Df SumOfSqs F Pr(>F)
## bix 1 0.31946 1.7225 0.001 ***
## fi 1 0.84028 4.5307 0.001 ***
## TCC 1 0.66653 3.5939 0.001 ***
## Cond 1 0.31856 1.7177 0.039 *
## O2 1 0.50456 2.7205 0.001 ***
## Temp 1 0.43066 2.3221 0.013 *
## pH 1 0.30100 1.6230 0.075 .
## tot.N 1 0.29959 1.6154 0.064 .
## N_NO3 1 0.30315 1.6345 0.053 .
## tot.P 1 0.24783 1.3363 0.133
## Residual 10 1.85462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 0.2943814
We can say that bix and fix, as well as the temeprature, oxygen and
conductivity have an impact on the microbial communities of Luny water
bodies. Additionally, the number of TCC (microbial load) in the water
body is also a significant indicator of the water community
composition.
Plot the constrained ordination if we want to but it could be a bit
redundadnt considering the individual ordinations of environmental
variables and communitiies. On the other hand, the visual depiction
implies need for more sample points, considering how sparse the
ordination is.
# vectors
ccavectors <- as.matrix(scores(simpleRDA, display = "bp", scaling = "sites")*3.32) %>%
t() %>% as.data.frame() %>%
rename("bix*" = "bix",
"Temp*" = "Temp",
"O2**" = "O2",
"fi***" = "fi",
"TCC***" = "TCC") %>%
t() %>% as.data.frame()
#bix* , fi***, TCC***, Cond, O2**, Temp*
# site coordinates
site_data <- scores(simpleRDA, display = "sites") %>%
as.data.frame() %>%
cbind(., data_for_pca$type_III) %>%
rename("type" = "data_for_pca$type_III")
# plotting
plot_cca <-
site_data %>%
ggplot( aes(x = CAP1, y = CAP2)) +
geom_point(aes(fill = type), shape = 21,
size = 5, color= "black" )+
geom_segment(data = ccavectors, aes(x = 0, y = 0, xend = CAP1, yend = CAP2),
size = 1.2, arrow = arrow(length = unit(0.5, "cm"))) +
geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
geom_text(data = ccavectors, aes(x = CAP1*1.2, y = CAP2*1.1,
label = rownames(ccavectors)),
size=6 ) +
theme_linedraw() +
scale_fill_brewer(palette = "Accent") +
theme(legend.background = element_blank(),
legend.box.background = element_rect(colour = "gray"))
plot_cca

There is certain indication that microbial community assembly of Lunz
lakes water samples and near water bodies is not impacted by the
environmental variables which were measured in this study. However, the
sample type provided a clearer impact on community assembly. Samples of
type sediment, soil and surface water showed community similarity within
the type category. To determine whether sample type is merely a proxy
for ecological conditions, environmental variables for samples other
than water samples need to be measured, which was not the case in this
study.
Contribution of sources to sinks was not impacted by spatial
distance, and the artificial lake provided no physical barrier.
Contribution by sample type provided more accurate insight than
contribution by “Source” category, which was an arbitrary category
determined as a mix of sample type and distance from the sink.
This was shown with Source 6 contribution, consisting of spring
sediment and water. Sample type analysis showed that spring sediment of
Source 6 contributed to sink sediments, and Source 6 water samples
contributed to water sinks.
In summary, this study provides a starting point in community
assembly research of microbial fauna of Lunz lakes catchment, and gives
indication that future studies should be focused on environmental
conditions of each microenvironment, and their ties to microbial
assembly, since categorising samples by distance or similarity does not
give clear results.
---
title: "Microbial community assembly analysis of Lunz lakes and the surrounding waterbodies" 
author:  "Angela Cukusic"
date: July 29, 2023
output:
  html_document:
    toc: true               # Table of contents
    toc_float: true         # Keep the table of contents floating on the page
    code_download: true     # Enable code download button
    code_copy: true         # Enable code copy button
    number_sections: true   # Numbering the sections
    theme: cerulean           # Choose a Bootstrap theme (check available themes)
    highlight: tango        # Syntax highlighting style (pick a style from available themes)

---

The main benefit of studying community assembly, and factors that impact it, is to further understand the connection of community diversity to ecosystem functionality, which is crucial in upkeeping ecosystems. With spatial distance shown to be a significant factor in community assembly processes, it is sensible to explore the waterbodies of the Lunz study site, which consists of various
environments with different conditions, and includes three springs, seven creeks, four groundwater wells, a pond, a hyporheic spring, and three lakes. The connectivity is presumed to be hindered at the middle point of the catchment area, the Middle Lake.

# Package loading
```{r, results='hide', message=FALSE}
library(FEAST) # for source tracking
library(tidyverse)
library(vegan)
library(FSA) # for dunn post-hoc test
library(caret) 
library(broom)
library(patchwork) # for combining plots into 1
library(ggfortify)
library(pairwiseAdonis) # for permanova post-hoc
```

From every sample, DNA was extracted and used for 16S rRNA gene amplicon sequencing. Amplicon sequencing and bioinformatic processing of the raw sequence data for the inference of amplicon sequence variants (ASV) using DADA2 was done, as well as other processing of organelles and mitochondria.

# Load data
```{r, warning=FALSE, results='hide'}


# load metadata
metadata <- read.csv("~/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/data/metadata_Lunz_revised.csv", 
                     row.names = 1)

# load asv table
asv_table <- read.csv("~/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/data/asv_table.csv", row.names=1)

# identifier need to be identical between asv_table and metadata
identical(rownames(metadata), colnames(asv_table))
# if there is some that are different use setdiff(rownames(metadata), colnames(asv_table)) to see which

# data on sources/sinks
feast_metadata <- read.csv("~/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/data/feast_metadata.csv", row.names=1)
# identifiers for sources have to be unique - source 1 needs to be Source1_1, Source1_2, ... )
# SampleID column need to be  rownames

```

# FEAST

FEAST - the fast expectation-maximization was used, as a more computational efficient and shorter alternative to SourceTracker.


Prepare the data for FEAST.
```{r, eval=FALSE}
# classes of datasets

# feast_metadata should be a data.frame
class(feast_metadata)

#asv should be matrix and each column an ineger in order to perform feast
feast_asv <- as.matrix(asv_table)                # matrix as a whole
    for(i in 1:ncol(feast_asv)){                 # each column as integer
                    feast_asv[, i] <- as.integer(feast_asv[, i])
}


# whole  table should be a matrix
class(feast_asv)

# individual column should be integer
class(feast_asv[, 1])



# final step - form needs to be as with vegan (sites rows, species columns)
feast_asv <- t(feast_asv)
feast_asv[1:10, 1:5]


#to remove from environment
rm(i) 


```


Perform FEAST. It exports the results and later on needs to be imported again.

The metadata table has three columns (i.e., 'Env', 'SourceSink', 'id'). 
The first column is a description of the sampled environment (e.g., sample names), 
the second column indicates if this sample is a source or a sink (can take the value 'Source' or 'Sink'). 
The third column is the Sink-Source id. 

When using multiple sinks, only the rows with 'SourceSink' = Sink need an id 
(between 1 - number of sinks in the data), the sources’ ids are blank. 



```{r, eval=FALSE}
feast_output <- FEAST(C = feast_asv,              # transposed asv_table
                      metadata = feast_metadata,  # sink or source information
                      different_sources_flag = 0, # 0 if sources are not assigned to different sinks
                      EM_iterations = 1000,       # number of iterations
                      dir_path = "C:/Users/Angela Cukusic/Documents/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/Results", 
                      outfile = "feast_results")  # prefix for saving output file

# COVERAGE should be included in the command if we want to indicate the rarefaction depth


```

```{r}
#we got an output, import it now
feast_table <- read.table("C:/Users/Angela Cukusic/Documents/faks/Ecology and Ecosystems/MEC-8 (Specific research project)/Results/feast_results_source_contributions_matrix.txt", 
                          sep = "\t", row.names = 1)

```



## Raw contribution to plottable data
To get an overview of the total contribution of each source (i.e. groups of samples) to the different sink we have to clean it.

```{r, results='hide'}
# feast_output is of class "list"
# we want to make it a matrix

# 1. make a new vector with source types (and it is "character")
sources <- paste("source", c(1:8), sep = "")


# 2. make empty matrix with nrow = number of SINKS (5), 
# and ncol = number of possible sources (20+ samples are designated to 1-8 sources)
source_contributions <- matrix(NA,                        # empty
                               nrow = nrow(feast_table),  # 5 sinks
                               ncol = length(sources))    # n sources

# 3. fill matrix 
for(i in 1:length(sources)){    # for each source add contribution 
  source_contributions[, i] <- rowSums(feast_table[,      # sum up contributions for all 5 rows
                                      grepl(sources[i], colnames(feast_table))]) 
                                                          # of only the columns which have sourceX in name
}

# 4. we added contributions assigned to each of 8 sources, now add from unknown sources
source_contributions <- cbind(source_contributions, feast_table$Unknown)


# 5. check if all contributuons for individual sink amount to 1
rowSums(source_contributions) 


# 6. convert to data frame and bring back rownames and colnames
source_contributions <- as.data.frame(source_contributions)
colnames(source_contributions) <- c(sources, "unknown") # col
source_contributions <- source_contributions %>%        # row
  mutate(sample_id = rownames(feast_table), .before = 1) 


#  final clean up of environment
rm( sources, i)

```

## Plot source contribution

```{r}
source_contributions %>% 

  pivot_longer(cols = -sample_id, names_to = "sources", values_to = "contributions") %>% 
  mutate(sources = as.factor(sources))     %>% 
  
  
# find  relative abundances for each date in each water kind (well/pump/surf)
  group_by(sample_id) %>%
  mutate(rel_cont = contributions / sum(contributions) * 100) %>%
  ungroup() %>%
  
# plotting
  ggplot(aes(x = sample_id, y = rel_cont, fill = sources)) +
  geom_bar(stat = "identity") +
  
# labelling   
  geom_text(aes(label =  
                  # label if % is big enough
                  ifelse(rel_cont > 2, 
                  # if yes then  paste, if no then empty ""
                  paste0(round(rel_cont, 4) , "%"), "" )  ),
            # how to position it
            position = position_stack(vjust = 0.5), size = 4) +

# theme  
  theme(legend.position = "none",
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.y = element_blank()) +
  theme_classic() + 
  scale_fill_brewer(palette = "Set1") +
  labs(x= "Sinks", y="Relative contributions %", fill = "Sources") +
  scale_x_discrete(labels = c("Untersee \n(31m)", 
                              "Untersee \n(5m)", 
                              "Untersee - \nnear OSB inflow", 
                              "Untersee - \nat OSB \ninflow", 
                              "Sediment: \nUntersee - \nat OSB \ninflow")) 


```

From five sink samples identified in this study, four were water samples, and one a sediment sample. Each sink had a substantial portion of microbes of unknown origin.

Additional to that, another visible trend is water samples being traced back to Source 1 (surface waters), with non-existent or minimal portion of Source 2 related microbes (surface waters' sediment and biofilm samples, see Fig.2.). 

The sediment sink, on the other hand, showed no influence of Source 1 on its composition, but had a high Source 2 origin.
Water sinks contained more ASVs not found in any of the sources than the sediment sink. 



```{r, warning=FALSE}
feast_table %>% 
  rownames_to_column("sample_id") %>% 
# find the two sinks in question  
  filter(sample_id %in% c("OSB_in_LUS_inflow_LUS_at_OSB_inflow",
                          "OSB_in_LUS_sed_LUS_sediment")) %>% 
  
# find the individual samples from source 6
  select( "sample_id",  contains("source6")) %>% 
# remove the last part of names, which says "source6"  
  rename_all(~str_remove(., ".{10}$")) %>% 
  
  pivot_longer(cols = -sample_id, names_to = "sources", values_to = "contributions") %>% 
  mutate(sources = as.factor(sources))     %>% 
  
  
# find  relative abundances for each date in each water kind (well/pump/surf)
  group_by(sample_id) %>%
  mutate(rel_cont = contributions / sum(contributions) * 100) %>%
  ungroup() %>%
  
# plotting
  
  ggplot(aes(x = sample_id, y = rel_cont, fill = sources)) +
  geom_bar(stat = "identity", color = "black") +
  
# labelling   
  geom_text(aes(label =  
                  # label if % is big enough
                  ifelse(rel_cont > 2, 
                  # if yes then  paste, if no then empty ""
                  paste0(round(rel_cont, 4) , "%"), "" )  ),
            # how to position it
            position = position_stack(vjust = 0.5), size = 4) +

# theme  
  theme(legend.position = "none",
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.y = element_blank()) +
  theme_classic() + 
  labs(x= "Sinks", y="Relative contributions %", fill = "Sources") +
  scale_x_discrete(labels = c("Untersee - \nat OSB \ninflow", 
                              "Sediment: \nUntersee - \nat OSB \ninflow")) +
  scale_fill_manual(values = c( "#006400", "#FFD700", "lightblue", "#FFD700",
                                "lightblue", "#006400", "lightblue" ))
```


Even though the two sinks at the inflow both had ~10% of Source 6 microbes (which was two locations - both sediment and water), it seems that more detailed look gives more insight that sample type is a better predictor of contribution than location is.

Result: water sink was more influenced by water samples communities, while sediment sink contained almost exclusively microbes from spring sediments and spring biofilms.



# Sample type distribution
```{r, warning=FALSE, results='hold'}

metadata_type <- metadata %>% 
                 rownames_to_column("sources") %>% 
                 select("sources", "type_III") %>% 
                 mutate(type_III = case_when(
                 # assign categories  
                   type_III %in% c("Creek", "Hyporheic",
                                   "Pond", "Lake", "Inflow") ~ "Surface", 
                            TRUE ~ type_III)) 




feast_table %>% 
  rownames_to_column("sink_id") %>% 
# rename the columns so that they can be connected to the metadata (type info)  
  rename_all(~str_remove(., ".{10}$")) %>% 
  pivot_longer(cols = - sink_id, names_to = "sources", values_to = "contributions") %>% 
# filter out contributions from unknow sources
  filter(sources != "Unknown") %>% 
# join with metadata
  left_join(., metadata_type, by = "sources") %>% 
  filter(type_III != "NA") %>% 
  filter( contributions < 0.05) %>% 
# plot
  ggplot(aes(x= type_III, y= contributions)) + 
  geom_boxplot(aes(fill = sink_id) ) +
  theme_linedraw() +
  scale_fill_manual(labels=c("Untersee (31m)", 
                             "Untersee (5m)",
                             "Untersee (near inflow)", 
                             "Untersee (at inflow)", 
                             "Sediment (at inflow)"), 
                    values=c("#acc7d9", "#8bb0ca","#5180a2","#3e647d","#d8b365")) +
  theme(legend.position=c(0.3,0.8), 
        legend.title = element_blank(), 
        legend.text = element_text(size = 14), 
        legend.background = element_rect(), 
        legend.box.background  =   element_rect(color = "gray")) +
  labs(y = "Contributions", x= "Sample type") + 
  scale_x_discrete(labels = c("Biofilm", "Ground water", "Sediment", 
                              "Soil", "Spring water", "Surface water") )
  
  

```


Surface waters contributed to all water sinks, but their effect on the sediment of the sink was minimal. 
Samples that did contribute to sediment sink are  biofilm, sediment and soil source samples. 
Additional to contributing to the sediment sink, biofilm samples also had an impact on the community at the inflow of the Lower Lake.
Springwater samples were the biggest contributor to the inflow, and had an impact on the
microbial assembly of the Lower Lake at its 31-meter depth.



# Spatial distribution

```{r, warning=FALSE, results='hold'}
metadata_rank <- metadata %>% 
                 rownames_to_column("sources") %>% 
                 select("sources", "rank") %>% 
                 filter(rank != "sink")

feast_table %>% 
  rownames_to_column("sink_id") %>% 
# rename the columns so that they can be connected to the metadata (rank info)  
  rename_all(~str_remove(., ".{10}$")) %>% 
  pivot_longer(cols = - sink_id, names_to = "sources", values_to = "contributions") %>% 
# filter out contributions from unknow sources
  filter(sources != "Unknown") %>% 
# join with metadata
  left_join(., metadata_rank, by = "sources") %>% 
  filter(rank != "NA") %>% 
# plot
  ggplot(aes(x= rank, y= contributions)) + 
  geom_boxplot( fill = "violet") +
  facet_wrap(~sink_id, scales = "free_y",  labeller = 
               as_labeller( c("LUS_dp_31m_LUS_31m" = "Untersee (31m)",
                              "LUS_dp_5m_LUS_5m" = "Untersee (5m)",
                              "LUS_in_OSB_LUS_near_OSB_inflow" = "Untersee (near inflow)",
                              "OSB_in_LUS_inflow_LUS_at_OSB_inflow" = "Untersee (at inflow)",
                              "OSB_in_LUS_sed_LUS_sediment" = "Sediment (at inflow)" ))) +
  theme_linedraw() + 
  labs(y = "Contributions", x= "Ranked by spatial distance ")

  
```

We hypothesized that a substantial effect of passive dispersal of microorganisms on the composition of microbial communities in the lake sinks should result in significantly declining contributions of sources with increasing spatial distance to the main lake (i.e. distance ranks since data on exact spatial distances were not available). 

Tackling this hypothesis, we did not find a significant correlation between spatial distance and sink contribution. None of the sinks followed the presumed trend of rank 1 having the highest sink contribution and rank 8 the lowest, nor was there a difference between the ranks before the connectivity barrier and after the barrier. 
The rank 6 is upstream of the Middle lake, which is the presumed barrier. 



# Pre-processing for community analysis

We need to account for the uneven sequencing depth.
```{r}
# number of reads per sample
asv_table %>% colSums() %>% summary() 

```

Considering all samples have the same number of reads the dataset has clearly already been rarefied by Lucas before giving it to me.




# Alpha diversity
```{r}

shannondiv <- vegan::diversity(t(asv_table))
# recalculating it as the effective number of species to be more comparable across studies.
ens <- exp(shannondiv) %>% as.data.frame() %>% rownames_to_column("sample_id") %>% 
  rename("diversity" = ".")

diversity_data <- specnumber(t(asv_table)) %>% as.data.frame() %>% rownames_to_column("sample_id") %>% 
  rename("richness" = ".") %>% inner_join(., ens, by ="sample_id" )

# we will plot diversity by source so make clean categorization here
feast_metadata %>% 
  # remove last two strings
  mutate(Env = str_remove(Env, ".{2}$")) %>% 
  # where SourceSink column is a source keep the value from the Env column
  mutate(Source_category = case_when(SourceSink == "Source" ~ Env,
  # otherwise keep the value from SourceSink column                                   
                                     TRUE ~ SourceSink) ) %>% 
  # some trouble with source 3
  mutate(Source_category = case_when(Source_category %in% c("source3", "source3_") ~ "source3",
                                     TRUE ~Source_category)) %>% 
  
  rownames_to_column("sample_id")  %>% 
  select("sample_id", "Source_category") %>% 
  mutate(Source_category = as.factor(Source_category)) %>% 
  # we now have clean category and can connect it to the diversity data
  inner_join(., diversity_data,   by = "sample_id") %>% 
  select(-sample_id) %>% 
  
  #pivot longer so we can facet
  pivot_longer(cols = -Source_category, names_to = "measure", values_to = "values") %>% 
  
  # plot
  ggplot( aes(x = Source_category, y = values)) +
  #stat_boxplot(geom ='errorbar') + 
  geom_boxplot(fill = "violet") +
  theme_linedraw()+
  coord_flip() + 
  facet_wrap(~measure, scales = "free",  labeller = 
               as_labeller( c("diversity" = "Shannon's diversity index", 
                              "richness" = "Richness") ) )+
  theme(legend.position = "none",
        axis.title=element_text(size=14,face="bold"),
        axis.text.y = element_text(size=10,face="bold"),
        strip.text = element_text(size = 14)) +
  labs( x = "Source category",
       y = NULL) +
  scale_x_discrete(labels = c("Sink",  "Source 1", "Source 2", "Source 3", "Source 4", 
                               "Source 5", "Source 6",  "Source 7",  "Source 8"))  



```

Richness of the sinks was the lowest compared to source richness. Sources 7, and 8,
which have a direct inflow to the sinks, had the highest richness. Source 1, which makes a big
contribution to sink communities had higher richness than the sinks it
contributed to.

Sink evenness was lower than evenness of sources. Species distribution of the sink leans towards a
few very dominant ASVs and many rare ones, indicating possible harder environmental conditions or bigger
competitiveness in sink. Source 3, with its soil samples, had high evenness. Source 4 (groundwater) had slightly lower
evenness and richness in comparison to other sources.

Test it:
```{r, results='hide'}
# 1. the same data manipulation as above
feast_metadata %>% 
  mutate(Env = str_remove(Env, ".{2}$")) %>% 
  mutate(Source_category = case_when(SourceSink == "Source" ~ Env,                           
                                     TRUE ~ SourceSink) ) %>% 
  mutate(Source_category = case_when(Source_category %in% c("source3", "source3_") ~ "source3",
                                     TRUE ~Source_category)) %>% 
  
  rownames_to_column("sample_id")  %>% 
  select("sample_id", "Source_category") %>% 
  mutate(Source_category = as.factor(Source_category)) %>% 
  inner_join(., diversity_data,   by = "sample_id") %>% 
  select(-sample_id) -> data_for_alpha


# 2. perform ANOVA/ kruskal-wallis

# 1. assumption of ANOVA: Normal distribution: 
# Each population to be compared should be normally distributed 
# (NOT THE ENTIRE DATASET).
shapiro.test (  data_for_alpha[which(data_for_alpha$Source_category == "source1"), "diversity"] ) # NOT ND
shapiro.test (  data_for_alpha[which(data_for_alpha$Source_category == "source2"), "diversity"] ) # ND
shapiro.test (  data_for_alpha[which(data_for_alpha$Source_category == "source3"), "diversity"] ) # ND
shapiro.test (  data_for_alpha[which(data_for_alpha$Source_category == "source4"), "diversity"] ) # NOT ND
shapiro.test (  data_for_alpha[which(data_for_alpha$Source_category == "source5"), "diversity"] ) # NOT ND
shapiro.test (  data_for_alpha[which(data_for_alpha$Source_category == "source6"), "diversity"] ) # ND
# source 7 is too small sample size to perform the test  
shapiro.test (  data_for_alpha[which(data_for_alpha$Source_category == "source8"), "diversity"] ) # ND
shapiro.test (  data_for_alpha[which(data_for_alpha$Source_category == "Sink"), "diversity"] ) # ND

# A: would be better to do a non-parametric test


## 2. assumption of ANOVA: Homogeneity of variance: 
## Variance in the populations compared should be the same/similar. 
bartlett.test ( diversity ~ Source_category, data= data_for_alpha )
# assumption met


# 3. test
kruskal.test (diversity ~ Source_category, data= data_for_alpha) 
# 4. post-hoc
FSA::dunnTest(diversity ~ Source_category, data= data_for_alpha)
```

Even though one assumption is met (hom.var.) we should not use ANOVA - the assumption of homogeneity of variance is an extension of the assumption of normality. It would not be feasible to compare a skewed distribution in one group to a normal distribution in another group, even if they have comparable variances!

Kruskal-Wallis test confirms differences in shannon's index. We do post-hoc diagnosing to see which groups are different with a Dunn test. The difference which is statistically sound is the diversity levels between the source 3 samples and the Sink, after adjusting the p-value.

Since source 3 consists of soil samples, it is not unexpected that it would have higher diversity.


# Beta diversity
```{r, results='hide', warning=FALSE}
bc_diss <- vegdist(t(asv_table), method = "bray")
nmds <- metaMDS(bc_diss)


# make palette for plotting
pal = c("green4", "navy", "orange", "brown", "blue", "lightblue")


# lets plot an nmds where the color is sample type, shape is source sink
as.data.frame(nmds$points) %>% 
  cbind(data_for_alpha$Source_category, metadata_type$type_III) %>% 
  rename("source_cat" = "data_for_alpha$Source_category",
         "type" = "metadata_type$type_III") %>% 
  mutate(source_cat = case_when(source_cat == "Sink" ~ "sink",
                                TRUE ~ "source")) ->  data_for_nmds
  
# plot 
data_for_nmds %>%   
  ggplot(aes(x = MDS1, y = MDS2)) +
  geom_point(aes(fill = type, color = type, 
                 shape= source_cat), 
             size = 5, stroke=1, alpha = 0.6) +
# customizing the plot  
  scale_fill_manual(values =  pal) + 
  scale_color_manual(values = pal) +
  scale_shape_manual(values = c(21,23)) +
  
  theme_linedraw() +
  annotate(geom="text",x=-1.4, y=1.5, label="stress =", color="black") +   
  annotate(geom="text", x=-1.1, y=1.5, label=round(nmds$stress,4),
           color="black") +
  
  #stat_ellipse(aes(colour = type), linewidth = 0.3) + 
  #stat_ellipse(aes(fill = type), geom="polygon",level=0.95,alpha=0.04) +
  
  theme(legend.background = element_blank(),
        legend.box.background = element_rect(colour = "gray")) +
  labs(color = "Water \ntype", shape = "Source") +
  guides(fill = FALSE) +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) -> plot_nmds
  



```

There was clear grouping of soil and sediment communities. 
Surface water, spring water and biofilm communities were more spread out, and more dissimilar to soil than to sediment
communities.

Out of five sinks, four water and one sediment sink, the most similar to sediment source samples, in terms of community assembly, was the sediment sink. It is also closer to the soil sources, whereas water sinks communities showed little similarity to soil sources.

```{r}
plot_nmds
```

## Test community differences
```{r, warning=FALSE, results='hide', message=FALSE}
# test differences in group centroid with adonis (anova version for multivariate statistics)
permanova <- adonis2(t(asv_table) ~ data_for_nmds$type, 
                    method = "bray" , permutations = 999)
permanova # 0.001
pair_res <- pairwiseAdonis::pairwise.adonis2(t(asv_table) ~ type, data = data_for_nmds)
# Spring_vs_Surface, GW_vs_Spring are only one not significantly distinct!!


# test dispersion as an assumption (variance test for multivariate statitics)
disp <- betadisper( vegdist(  t(asv_table), method = "bray" ), data_for_nmds$type) 
anova(disp)
TukeyHSD(disp) 
# Surface-GW have diff dispersions, does not interfere with centroid differences

 
```


There is signifiicant difference in communities of different sample type (***).
Biofilm, sediment and soil are distinctly different from each other and from surface water, groundwater and spring.
Surface water, groundwater and spring are not easily identifiable between themselves.


# Environmental data

## pca (environmenta variables of water communities)
```{r, warning=FALSE}
df_pca <- metadata %>% 
  # we only have measurements for water, not for soil nor sediment 
  filter(type_I == "Water") %>% 
  select(where(is.numeric)) %>% 
  select(-c("filter", # dont need it
            "N_NO2", "P_PO4")) # NAs
 

# remove the highly correlated ones, like internalATP and totalATP
cor_matrix <- cor(df_pca)
highly_correlated <- caret::findCorrelation(cor_matrix, cutoff = 0.9)
df_pca <- df_pca[, -highly_correlated]

# now there is 22 observations and 10 environmental variables
# lets see if any env vars can be packed neatly in PCs
pca_env <- prcomp(df_pca, scale = TRUE)
biplot(pca_env)

# i believe it is beneficial to remove the sample PX_gw since it is a single outlier in almost every variable, and stresses the ordination
pca_env <- prcomp(df_pca[rownames(df_pca) != "PX_gw",], scale = TRUE)


# we need the column for coloring by type, get it from the original metadata
 metadata[rownames(df_pca) != "PX_gw", c(colnames(df_pca), "type_III", "type_I")] %>% 
   # remove the outlier sample, the columns that are highly correlated and soil/sed samples
  filter(type_I == "Water") -> data_for_pca

autoplot(pca_env, data =data_for_pca , 
         colour = 'type_III',
         size=5, loadings=TRUE, 
         loadings.colour="black", 
         loadings.label=TRUE, loadings.label.size=5, 
         loadings.label.colour="black" )+ 
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  geom_point(shape= 21, color = "black", size = 5) + 
  theme_linedraw() +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  scale_color_brewer(palette = "Accent")+
  theme(legend.background = element_blank(),
        legend.box.background = element_rect(colour = "gray"))  -> pca_plot

pca_plot
```





The first three components have 1+ variance explanation. See more detailed what PC1, PC2 and PC3 (as combinations of environmental variables) consist of.


normal data: variance = (st.dev)^2 pca data: eigenvalue = (st.dev)^2
“eigenvalues represent the variance explained by each principal component” -> eigen ~ variance


```{r, warning=FALSE, results='hold'}

# Extract the loadings from the PCA result and convert to tidy format 
loadings_df <-   tidy(pca_env, matrix = "rotation")


# PC1
# Create the ggplot2 object for plotting the loadings
loadings_df %>% filter(PC == "1") %>% 
  #x=reorder(class,-amount,sum)
ggplot( aes(x = reorder(column, -value), y = value, group = factor(column))) +
  geom_bar(stat = "identity", position = "dodge", 
           fill= "gray90", color = "gray20") +
  geom_text(aes(label = column), position = position_dodge(width = 0.9), 
            #vjust = "inward",  
            angle = 90 , hjust = "inward") +
  labs(x = "Variables of PC1", y = "Loadings") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        legend.position = "none")  -> plot_pca1

# PC2 
loadings_df %>% filter(PC == "2") %>% 
  #x=reorder(class,-amount,sum)
ggplot( aes(x = reorder(column, -value), y = value, group = factor(column))) +
  geom_bar(stat = "identity", position = "dodge", 
           fill= "gray90", color = "gray20") +
  geom_text(aes(label = column), position = position_dodge(width = 0.9), 
            #vjust = "inward",  
            angle = 90 , hjust = "inward") +
  labs(x = "Variables of PC2", y = "Loadings") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        legend.position = "none") -> plot_pca2

# PC3
loadings_df %>% filter(PC == "3") %>% 
  #x=reorder(class,-amount,sum)
ggplot( aes(x = reorder(column, -value), y = value, group = factor(column))) +
  geom_bar(stat = "identity", position = "dodge", 
           fill= "gray90", color = "gray20") +
  geom_text(aes(label = column), position = position_dodge(width = 0.9), 
            #vjust = "inward",  
            angle = 90 , hjust = "inward") +
  labs(x = "Variables of PC3", y = "Loadings") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        legend.position = "none") -> plot_pca3

plot_pca1 + plot_pca2 + plot_pca3
```


Test pca dispersions.
```{r, warning=FALSE}
# we can check the significance in difference between groups in both pca and nmds

euclidean_dist <- data_for_pca %>% select(-c("type_I" , "type_III")) %>% 
                  scale() %>% as.data.frame()  %>% 
                  dist(., method = "euclidean") 

disp_pca <- betadisper( euclidean_dist, data_for_pca$type_III )
anova(disp_pca) #no sign. differences 


disp_pca$distances %>% as.data.frame() %>% 
  rename("Distance" = ".") %>% 
  rownames_to_column("sample_id") %>% 
  cbind(data_for_pca) %>% 
# plot
  ggplot( aes(x = type_III, y = Distance)) +
  geom_boxplot( fill = "violet") +
  labs(x = NULL, y = "Environmental differences \nstandardized Euclidean distance") +
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 10),
        axis.title.y = element_text(size = 10)) -> pca_disp

```

## nmds (water communities)
```{r, warning=FALSE, results='hide'}

asv_water <- asv_table [,rownames(df_pca)] %>% select(!"PX_gw")
bc_diss_water <- vegdist(t(asv_water), method = "bray")
nmds_water <- metaMDS(bc_diss_water)


```

```{r, warning=FALSE, results='hold'}

as.data.frame(nmds_water$points) %>% 
  cbind(data_for_pca$type_III) %>% 
  rename("type" = "data_for_pca$type_III") %>% 
  
# plot 
  ggplot(aes(x = MDS1, y = MDS2)) +
  geom_point(aes(fill = type), shape = 21, 
             size = 5, color= "black" ) +
# customizing the plot  
  scale_fill_brewer(palette = "Accent") +
  theme_linedraw() +
  annotate(geom="text",x=-1.4, y=1.5, label="stress =", color="black") +   
  annotate(geom="text", x=-1.1, y=1.5, label=round(nmds_water$stress,4),
           color="black") +
  
  theme(legend.background = element_blank(),
        legend.box.background = element_rect(colour = "gray")) +
  labs(color = "Water \ntype", shape = "Source") +
  guides(fill = FALSE) +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) -> nmds_plot_water
  


euclidean_dist <- data_for_pca %>% select(-c("type_I" , "type_III")) %>% 
                  scale() %>% as.data.frame()  %>% 
                  dist(., method = "euclidean") 

disp_nmds <- betadisper( bc_diss_water, data_for_pca$type_III )
anova(disp_nmds) # sign. differences 


disp_nmds$distances %>% as.data.frame() %>% 
  rename("Distance" = ".") %>% 
  rownames_to_column("sample_id") %>% 
  cbind(data_for_pca) %>% 
# plot
  ggplot( aes(x = type_III, y = Distance)) +
  geom_boxplot( fill = "violet") +
  labs(x = NULL, y = "Bray-curtis distance") +
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 10),
        axis.title.y = element_text(size = 10)) -> nmds_water_disp


 nmds_plot_water  + pca_plot +
  nmds_water_disp + pca_disp 
```


Lake communities and groundwater communties seem to have relatively distinct microbial composition.
The hyporheic, pond and inflow water consist only of one sample point, when we remove the soil and sediment samples for this analysis, so it is not possible to draw conclusions. But the rest of the samples have a similar relation for both the environmental variaton and the community variation - environmental homogeneity of the groundwater samples in visible in the low dispersion of the groundwater microbial communities. (check significance and put on plot later) 

## Constrained ordiantion

CAP ordination as an distance-based alternative to the RDA (which uses bray, as opposed to RDA's euclidean).
Hellinger transformation (square root of relative abundances) is used to account for compositionality of data. 
```{r, warning=FALSE, results='hold'}

data_for_pca %>% select(where(is.numeric)) %>% 
  scale() %>% as.data.frame() -> env.data

decostand(t(asv_water), method = "hellinger") -> sp.dat

# check
identical(rownames(env.data), colnames(asv_water)) #TRUE

# perform model
simpleRDA <- capscale(sp.dat ~  ., data=env.data ,
                 distance = "bray")

vif.cca(simpleRDA) # all variables lower than 10


# test

# test env paameters
anova.model3 <- anova(simpleRDA, step=1000, by = "term")
anova.model3
# sign: bix* , fi***, TCC***, Cond, O2**,  Temp*

# new model 
simpleRDA <- capscale(sp.dat ~  bix + fi + TCC + Cond + O2 + Temp, data=env.data ,
                 distance = "bray")
# test env paameters
anova.model3 <- anova(simpleRDA, step=1000, by = "term")

# test of all canonical axes
anova.model <- anova.cca(simpleRDA, by='axis', step=1000)  # CAP1,CAP2,CAP3 significant 

# test the whole model
anova.model2 <- anova.cca(simpleRDA, step=1000) # sign
RsquareAdj(simpleRDA)$adj.r.squared # explains 30%

```

We can say that bix and fix, as well as the temeprature, oxygen and conductivity have an impact on the microbial communities of Luny water bodies. Additionally, the number of TCC (microbial load) in the water body is also a significant indicator of the water community composition.



Plot the constrained ordination if we want to but it could be a bit redundadnt considering the individual ordinations of environmental variables and communitiies. On the other hand, the visual depiction implies need for more sample points, considering how sparse the ordination is.
```{r, warning=FALSE, results='hold'}
# vectors
ccavectors <- as.matrix(scores(simpleRDA, display = "bp", scaling = "sites")*3.32) %>% 
  t() %>% as.data.frame() %>% 
  rename("bix*" = "bix",
         "Temp*" = "Temp",
         "O2**" = "O2",
         "fi***" = "fi",
         "TCC***" = "TCC") %>%
  t() %>% as.data.frame()

#bix* , fi***, TCC***, Cond, O2**,  Temp*


# site coordinates
site_data <- scores(simpleRDA, display = "sites") %>% 
  as.data.frame() %>% 
  cbind(., data_for_pca$type_III) %>% 
  rename("type" = "data_for_pca$type_III")

# plotting
plot_cca <- 
  site_data %>% 
  ggplot( aes(x = CAP1, y = CAP2)) +
  geom_point(aes(fill = type), shape = 21, 
             size = 5, color= "black" )+  
  geom_segment(data = ccavectors, aes(x = 0, y = 0, xend = CAP1, yend = CAP2), 
               size = 1.2, arrow = arrow(length = unit(0.5, "cm"))) +
   geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
   geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  geom_text(data = ccavectors, aes(x = CAP1*1.2, y = CAP2*1.1, 
                                   label = rownames(ccavectors)),
                                   size=6 ) +
  theme_linedraw() +
  scale_fill_brewer(palette = "Accent") +
     theme(legend.background = element_blank(),
           legend.box.background = element_rect(colour = "gray")) 

plot_cca
```

There is certain indication that microbial community assembly of Lunz lakes water samples and near water bodies is not impacted by the environmental variables which were measured in this study. However, the sample type provided a clearer impact on community
assembly. Samples of type sediment, soil and surface water showed community similarity within the type category. To determine whether sample type is merely a proxy for ecological conditions, environmental variables for samples other than water samples need to be measured, which was not the case in this study. 

Contribution of sources to sinks was not impacted by spatial distance, and the artificial lake provided no physical barrier. 
Contribution by sample type provided more accurate insight than contribution by “Source” category, which was an arbitrary category determined as a mix of sample type and distance from the sink. 

This was shown with Source 6 contribution, consisting of spring sediment and water. Sample type analysis showed that spring sediment of Source 6 contributed to sink sediments, and Source 6 water samples contributed to water sinks.

In summary, this study provides a starting point in community assembly research of microbial fauna of Lunz lakes catchment, and gives indication that future studies should be focused on environmental conditions of each microenvironment, and their ties to microbial
assembly, since categorising samples by distance or similarity does not give clear results.
