Soil Sample project

Metagenomic observations from Soil Microbial Communities: Functional Genes, Adaptations, Species Distributions, and Interactions.

Introduction- According to Libretexts (2022), soil formation involves 45% inorganic minerals, mostly sand, silt, and clay, allowing elements such as ammonia, calcium, and zinc to be available. The organic matter in soil constitutes 5% (Libretexts, 2022) of macroorganisms and microorganisms, which may include dead or decaying matter. These are crucial for completing the ecosystem with continuous nutrient cycles such as phosphorus, carbon, and nitrogen (Bormann & Likens, 1970). Soil also contains 25% water (Libretexts, 2022), which provides essential hydrocarbons for cellular processes for macro and microorganisms, and 25% air (Libretexts, 2022), which is ideal for aerobic organisms.The diversity of microbial populations in soil fluctuates and adapts quickly due to the quick reproduction of generations, DNA mutations and genome or epigenetic modifications (Elena & Lenski, 2003) due to changes in PH, temperature, and nutrient availability in soil. (Daniel, 2005). These environmental factors affect the microbial’s functional genes, adaptations, species distributions, interactions, and evolutionary relationships (Elena & Lenski, 2003). These complex microscopic communities’ genotypes and phenotypes can be observed and compared with other abundant microbial communities in various habitats (Fierer & Jackson, 2006). Originating from discovering microorganisms under a microscope (Gest, 2004) by Robert Hooke and Antoni Van Leeuwenhoek. The continuous modern advances of inspection and deeper understanding of the biodiversity of prokaryotic species today uncover their function, purpose and possibilities, as well as resemblance from DNA and RNA sequencing and data analysis. (Burke et al., 2011). In particular, the “Third-generation sequencing” (Petersen et al., 2019) Oxford Nanopore sequencing 16S Barcoding Kit is specifically designed to target the rRNA gene, allowing for identifying bacterial communities. The technology relies on DNA strands passing through nanopores, altering electrical currents to determine base sequences (Petersen et al., 2019). Unlike Next Generation Sequencing, where DNA fragments are short-reads (Soliman et al., 2017), nanopore technology can sequence long-reads, generating additional DNA sequences for further analysis; this makes it an optimal choice for observing microorganisms in soil samples. This study investigates and compares soil samples from three locations: the lake (on campus), the Wivenhoe trail (outside campus), and the garden (outside campus). To explore what species are present in soil samples in the various locations, critical functional genes in the microorganisms found in soil samples and any unique DNA adaptations or traits in the samples that evolved in response to habitat conditions. In addition, metagenomic analysis of soil DNA provides insights into community interactions, for example, ecological competition or symbiosis. Hypothesis 1: The study aims to show that the conserved functional genes in microorganisms found in soil differ significantly in different environmental habitats.Hypothesis 2: Microorganisms in soil samples show unique genetic adaptations and traits within different locations and environmental conditions.

install.packages("readr")

## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

library("readr")

data = read_tsv("data.tsv")

## Rows: 3432 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (1): tax
## dbl (26): barcode01, barcode02, barcode03, barcode04, barcode05, barcode06, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

## # A tibble: 6 × 27
##   tax      barcode01 barcode02 barcode03 barcode04 barcode05 barcode06 barcode07
##   <chr>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Bacteri…     28213       345     21467     62888       664         0         0
## 2 Bacteri…      6073        52     36614     87607       168         0         0
## 3 Bacteri…      7734        31     16041     44558       302         0         0
## 4 Bacteri…      8797        92      8632     43361       269         0         0
## 5 Bacteri…      7951        31     13659     31858        85         0         0
## 6 Bacteri…     39851       113      1628      2193       199         0         0
## # ℹ 19 more variables: barcode08 <dbl>, barcode09 <dbl>, barcode10 <dbl>,
## #   barcode11 <dbl>, barcode12 <dbl>, barcode13 <dbl>, barcode14 <dbl>,
## #   barcode15 <dbl>, barcode16 <dbl>, barcode17 <dbl>, barcode18 <dbl>,
## #   barcode19 <dbl>, barcode20 <dbl>, barcode21 <dbl>, barcode22 <dbl>,
## #   barcode23 <dbl>, barcode24 <dbl>, unclassified <dbl>, total <dbl>

# Install and load dplyr package
install.packages("dplyr")

## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Select specific columns
df_cleaned <- data %>%
  select(`tax`, `barcode01`, `barcode02`, `barcode03`, `barcode04`, `barcode05`, `barcode10`, unclassified, total)

# View the cleaned dataframe
head(df_cleaned)

## # A tibble: 6 × 9
##   tax   barcode01 barcode02 barcode03 barcode04 barcode05 barcode10 unclassified
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>        <dbl>
## 1 Bact…     28213       345     21467     62888       664     23479         3543
## 2 Bact…      6073        52     36614     87607       168      1646         3789
## 3 Bact…      7734        31     16041     44558       302      6639         2094
## 4 Bact…      8797        92      8632     43361       269      1264         1666
## 5 Bact…      7951        31     13659     31858        85      1043         1631
## 6 Bact…     39851       113      1628      2193       199      5357         1170
## # ℹ 1 more variable: total <dbl>

# Aggregate data by species
species_aggregate <- df_cleaned %>%
  group_by(tax) %>%
  summarise(across(starts_with("barcode"), sum, na.rm = TRUE)) # Summing up all barcode columns for each species

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(starts_with("barcode"), sum, na.rm = TRUE)`.
## ℹ In group 1: `tax =
##   "Archaea;Archaea_none;Euryarchaeota;Halobacteria;Halobacteriales;Haloarculaceae;Haloarcula"`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

# View the aggregated data
head(species_aggregate)

## # A tibble: 6 × 7
##   tax                barcode01 barcode02 barcode03 barcode04 barcode05 barcode10
##   <chr>                  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Archaea;Archaea_n…         1         0         0         0         0         0
## 2 Bacteria;Bacteria…        91        12       145        66         4        28
## 3 Bacteria;Bacteria…       529        22      2299      1237        22        35
## 4 Bacteria;Bacteria…       124         5       767       236         2         5
## 5 Bacteria;Bacteria…       171         2       654       848         4         8
## 6 Bacteria;Bacteria…       350         9      2456      1023         6        15

# Install the phyloseq package
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

## Bioconductor version '3.12' is out-of-date; the current release version '3.18'
##   is available with R version '4.3'; see https://bioconductor.org/install

BiocManager::install("phyloseq")

## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: https://cloud.r-project.org

## Bioconductor version 3.12 (BiocManager 1.30.22), R 4.0.4 (2021-02-15)

## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'phyloseq'

## Installation paths not writeable, unable to update packages
##   path: /usr/lib/R/library
##   packages:
##     boot, class, cluster, codetools, foreign, KernSmooth, lattice, MASS,
##     Matrix, mgcv, nlme, nnet, rpart, spatial, survival

install.packages("lattice")

## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

library(phyloseq)
library(lattice)
install.packages(c("ggplot2", "dplyr", "tidyr", "readr", "purrr", "tibble", "stringr", "forcats", "magrittr" ))

## Installing packages into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

library(magrittr)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

install.packages("magrittr")

## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

library(magrittr)
install.packages("tidyr")

## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

library(tidyr)

df_cleaned <- species_aggregate %>% separate(tax, sep = ";", into = c("Domain", "Kingdom", "Phylum", "Class", "Order", "Family", "Genus"))
head(df_cleaned)

## # A tibble: 6 × 13
##   Domain   Kingdom Phylum Class Order Family Genus barcode01 barcode02 barcode03
##   <chr>    <chr>   <chr>  <chr> <chr> <chr>  <chr>     <dbl>     <dbl>     <dbl>
## 1 Archaea  Archae… Eurya… Halo… Halo… Haloa… Halo…         1         0         0
## 2 Bacteria Bacter… Abdit… Abdi… Abdi… Abiti… Abdi…        91        12       145
## 3 Bacteria Bacter… Acido… Acid… Acid… Acido… Acid…       529        22      2299
## 4 Bacteria Bacter… Acido… Acid… Acid… Acido… Acid…       124         5       767
## 5 Bacteria Bacter… Acido… Acid… Acid… Acido… Acid…       171         2       654
## 6 Bacteria Bacter… Acido… Acid… Acid… Acido… Acid…       350         9      2456
## # ℹ 3 more variables: barcode04 <dbl>, barcode05 <dbl>, barcode10 <dbl>

tax_data <- df_cleaned %>% select(c("Phylum", "Class", "Order", "Family", "Genus"))

tax_ranks <- colnames(tax_data)

tax_data <- tax_table(tax_data)

## Warning in .local(object): Coercing from data.frame class to character matrix 
## prior to building taxonomyTable. 
## This could introduce artifacts. 
## Check your taxonomyTable, or coerce to matrix manually.

dimnames(tax_data)[[2]] <- tax_ranks

head(tax_data)

## Taxonomy Table:     [6 taxa by 5 taxonomic ranks]:
##     Phylum             Class            Order               Family             
## sp1 "Euryarchaeota"    "Halobacteria"   "Halobacteriales"   "Haloarculaceae"   
## sp2 "Abditibacteriota" "Abditibacteria" "Abditibacteriales" "Abitibacteriaceae"
## sp3 "Acidobacteria"    "Acidobacteriia" "Acidobacteriales"  "Acidobacteriaceae"
## sp4 "Acidobacteria"    "Acidobacteriia" "Acidobacteriales"  "Acidobacteriaceae"
## sp5 "Acidobacteria"    "Acidobacteriia" "Acidobacteriales"  "Acidobacteriaceae"
## sp6 "Acidobacteria"    "Acidobacteriia" "Acidobacteriales"  "Acidobacteriaceae"
##     Genus            
## sp1 "Haloarcula"     
## sp2 "Abditibacterium"
## sp3 "Acidicapsa"     
## sp4 "Acidipila"      
## sp5 "Acidisarcina"   
## sp6 "Acidobacterium"

# first, extract all columns starting with 'barcode'
otu_tab <- df_cleaned %>% select(starts_with("barcode"))

# now we convert it to an otuTable object
# the 'taxa_are_rows' argument tells phyloseq which orientation our data are in
otu_tab <- otu_table(otu_tab, taxa_are_rows = T)

head(otu_tab)

## OTU Table:          [6 taxa and 6 samples]
##                      taxa are rows
##     barcode01 barcode02 barcode03 barcode04 barcode05 barcode10
## sp1         1         0         0         0         0         0
## sp2        91        12       145        66         4        28
## sp3       529        22      2299      1237        22        35
## sp4       124         5       767       236         2         5
## sp5       171         2       654       848         4         8
## sp6       350         9      2456      1023         6        15

microbiome <- phyloseq(otu_tab, tax_data)

microbiome # prints a nice summary of the data

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3432 taxa and 6 samples ]
## tax_table()   Taxonomy Table:    [ 3432 taxa by 5 taxonomic ranks ]

head(otu_table(microbiome))

## OTU Table:          [6 taxa and 6 samples]
##                      taxa are rows
##     barcode01 barcode02 barcode03 barcode04 barcode05 barcode10
## sp1         1         0         0         0         0         0
## sp2        91        12       145        66         4        28
## sp3       529        22      2299      1237        22        35
## sp4       124         5       767       236         2         5
## sp5       171         2       654       848         4         8
## sp6       350         9      2456      1023         6        15

head(tax_table(microbiome))

## Taxonomy Table:     [6 taxa by 5 taxonomic ranks]:
##     Phylum             Class            Order               Family             
## sp1 "Euryarchaeota"    "Halobacteria"   "Halobacteriales"   "Haloarculaceae"   
## sp2 "Abditibacteriota" "Abditibacteria" "Abditibacteriales" "Abitibacteriaceae"
## sp3 "Acidobacteria"    "Acidobacteriia" "Acidobacteriales"  "Acidobacteriaceae"
## sp4 "Acidobacteria"    "Acidobacteriia" "Acidobacteriales"  "Acidobacteriaceae"
## sp5 "Acidobacteria"    "Acidobacteriia" "Acidobacteriales"  "Acidobacteriaceae"
## sp6 "Acidobacteria"    "Acidobacteriia" "Acidobacteriales"  "Acidobacteriaceae"
##     Genus            
## sp1 "Haloarcula"     
## sp2 "Abditibacterium"
## sp3 "Acidicapsa"     
## sp4 "Acidipila"      
## sp5 "Acidisarcina"   
## sp6 "Acidobacterium"

# Can change which taxonomic rank the plot shows
plot_bar(microbiome, fill = "Phylum")

Presented here is a visual representation in the form of a stacked bar chart showcasing the prevalence of different bacterial phyla across multiple samples. Each bar is color-coded to denote a specific phylum, and the length of each color segment indicates its abundance in the respective sample. This informative tool aids in comprehending the diversity and relative prevalence of various bacteria in each sample, while the array of colors within each bar offers valuable insights into the intricacies of the microbial ecosystem. This knowledge is crucial for ecological and biological studies where accurate understanding of the makeup of microbial communities is paramount.

sample_sums(microbiome) # shows the number of sequences in each sample

## barcode01 barcode02 barcode03 barcode04 barcode05 barcode10 
##    408376    395805   1160754    881449     12167    362660

rarefiedData <- rarefy_even_depth(microbiome,
    sample.size = min(sample_sums(microbiome)), replace = F)

## You set `rngseed` to FALSE. Make sure you've set & recorded
##  the random seed of your session for reproducibility.
## See `?set.seed`

## ...

## 1514OTUs were removed because they are no longer 
## present in any sample after random subsampling

## ...

# Calculate species richness of ACE and Shannon
richness_data <- estimate_richness(rarefiedData, measures = c("ACE", "Shannon"))

# View the first few lines of the diversity data
head(richness_data)

##                ACE   se.ACE  Shannon
## barcode01 1294.690 19.74187 5.144367
## barcode02 1486.555 21.24376 5.181883
## barcode03 1253.298 18.41328 5.695419
## barcode04 1058.373 17.97829 4.839677
## barcode05 1071.254 17.39766 5.250453
## barcode10 1163.166 18.24988 5.367947

# Load the ggplot2 package
library(ggplot2)

# Create a data frame with your data
data <- data.frame(
  Barcode = c("barcode01", "barcode02", "barcode03", "barcode04", "barcode05", "barcode10"),
  ACE = c(1240.594, 1538.493, 1251.353, 1055.012, 1071.254, 1130.701),
  se.ACE = c(18.95278, 22.25332, 18.53425, 17.85359, 17.39766, 17.79527),
  Shannon = c(5.147862, 5.196808, 5.693757, 4.823278, 5.250453, 5.345040)
)

# Calculate the upper and lower limits of the error bars for ACE
data$ACE_upper <- data$ACE + data$se.ACE
data$ACE_lower <- data$ACE - data$se.ACE

# Create the plot with error bars for ACE
ggplot(data, aes(x = Barcode)) +
  geom_errorbar(aes(ymin = ACE_lower, ymax = ACE_upper), width = 0.2) +
  geom_point(aes(y = ACE), color = 'red', size = 3) +
  geom_line(aes(y = ACE, group = 1), color = 'red') +
  geom_point(aes(y = Shannon * 100), color = 'blue', size = 3) + # Multiplying by 100 for visualization purposes
  geom_line(aes(y = Shannon * 100, group = 1), color = 'blue') +
  theme_minimal() +
  labs(title = "ACE and Shannon across Barcodes", x = "Barcode", y = "Value") +
  scale_y_continuous(sec.axis = sec_axis(~ . / 100, name = "Shannon")) # Adding a secondary axis for Shannon

A line plot shows the variation of two alpha diversity measures, the ACE Index and the Shannon Index, across different samples labelled as barcodes. The ACE Index estimates species richness, while the Shannon Index reflects the species’ abundance and evenness. The x-axis shows the sample barcodes and the y-axis displays the diversity indices. The error bars on the ACE Index indicate the variability of the estimates. The peak at barcode02 on the ACE Index line implies a higher estimated number of species there.

# Use ggplot2 to create the plot with ACE and Shannon
ggplot(data, aes(x = Barcode)) +
  # Add ACE points
  geom_point(aes(y = ACE), color = "black") +
  # Add error bars for ACE
  geom_errorbar(aes(ymin = ACE_lower, ymax = ACE_upper), width = 0.2) +
  # Add Shannon points
  geom_point(aes(y = Shannon * 200), color = "blue", shape = 17) + # Multiplying Shannon by a factor to scale appropriately
  # Customizations
  theme_minimal() +
  labs(title = "Alpha Diversity Measures for Samples", x = "Barcode", y = "ACE") +
  # Adjust the secondary axis for the Shannon index
  scale_y_continuous(
    name = "ACE",
    sec.axis = sec_axis(~ . / 200, name="Shannon") # Adjust this transformation to match the scale of Shannon
  )

# Display the plot
ggsave("alpha_diversity_plot.png", width = 10, height = 6, dpi = 300)

The graph presents a comparative analysis of alpha diversity within samples using two metrics: the ACE Index and the Shannon Index. The ACE Index, represented by black error bars, estimates species richness with a focus on rare species, while the Shannon Index, indicated by blue triangles, evaluates species’ abundance and evenness. The x-axis labels ‘barcode01’ through ‘barcode10’ likely denote individual samples or sequences, and the dual y-axes allow for a direct comparison between the two indices. This visualization aids in discerning the biodiversity of each sample, where higher values on both indices indicate greater diversity. The error bars capture variability in the ACE Index, which provides insight into the confidence of these estimations.

# First, create the data frame with your data
data <- data.frame(
  Barcode = c("barcode01", "barcode02", "barcode03", "barcode04", "barcode05", "barcode10"),
  ACE = c(1240.594, 1538.493, 1251.353, 1055.012, 1071.254, 1130.701),
  Shannon = c(5.147862, 5.196808, 5.693757, 4.823278, 5.250453, 5.345040)
)

# Descriptive statistics for ACE
ace_mean <- mean(data$ACE)
ace_sd <- sd(data$ACE)
ace_se <- ace_sd / sqrt(length(data$ACE))

# Descriptive statistics for Shannon
shannon_mean <- mean(data$Shannon)
shannon_sd <- sd(data$Shannon)
shannon_se <- shannon_sd / sqrt(length(data$Shannon))

# Output the descriptive statistics
cat("Descriptive Statistics for ACE:\n")

## Descriptive Statistics for ACE:

cat("Mean:", ace_mean, "SD:", ace_sd, "SE:", ace_se, "\n\n")

## Mean: 1214.568 SD: 178.8791 SE: 73.02707

cat("Descriptive Statistics for Shannon:\n")

## Descriptive Statistics for Shannon:

cat("Mean:", shannon_mean, "SD:", shannon_sd, "SE:", shannon_se, "\n")

## Mean: 5.242866 SD: 0.2831964 SE: 0.1156144

nmdsResult <- ordinate(rarefiedData, "NMDS", "bray")

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.005619765 
## Run 1 stress 0.002148052 
## ... New best solution
## ... Procrustes: rmse 0.03755366  max resid 0.0677368 
## Run 2 stress 0.004187829 
## Run 3 stress 0.2761339 
## Run 4 stress 0.002923068 
## Run 5 stress 0.001145372 
## ... New best solution
## ... Procrustes: rmse 0.01144639  max resid 0.02058192 
## Run 6 stress 0.002429543 
## Run 7 stress 0.002722903 
## Run 8 stress 0.00488279 
## Run 9 stress 0.001715008 
## Run 10 stress 0.001243233 
## ... Procrustes: rmse 0.004500633  max resid 0.009126539 
## ... Similar to previous best
## Run 11 stress 0.001265862 
## ... Procrustes: rmse 0.004603063  max resid 0.009135091 
## ... Similar to previous best
## Run 12 stress 9.961051e-05 
## ... New best solution
## ... Procrustes: rmse 0.01077764  max resid 0.02029719 
## Run 13 stress 0.002386339 
## Run 14 stress 0.0002277899 
## ... Procrustes: rmse 0.001327226  max resid 0.002401117 
## ... Similar to previous best
## Run 15 stress 0.0007979903 
## Run 16 stress 0.005543951 
## Run 17 stress 0.001823967 
## Run 18 stress 0.003237181 
## Run 19 stress 0.007731411 
## Run 20 stress 0.0002161793 
## ... Procrustes: rmse 0.001209897  max resid 0.002157458 
## ... Similar to previous best
## *** Best solution repeated 2 times

## Warning in metaMDS(veganifyOTU(physeq), distance, ...): stress is (nearly)
## zero: you may have insufficient data

plot_ordination(rarefiedData, nmdsResult, type = "sample")

## No available covariate data to map on the points for this plot `type`

The scatter plot provided depicts Non-metric Multidimensional Scaling (NMDS), a technique that reduces multidimensional data into two dimensions for visual representation. The two axes, NMDS1 and NMDS2, signify two dimensions of variation based on ranked dissimilarities among data points. Each point on the plot represents an individual sample, with proximity indicating similarity; closer points suggest greater similarity. This plot is useful in identifying clusters or gradients, indicating patterns or groupings within the data. However, detailed interpretation of the plot’s structure is limited without additional annotations or a stress value. The significance of this visualization lies in its ability to simplify complex relationships within high-dimensional data.

Soil Sample project

Catherine Taylor

2023-12-22

Metagenomic observations from Soil Microbial Communities: Functional Genes, Adaptations, Species Distributions, and Interactions.