Introduction
As bacterivores, C. elegans feed on bacteria that grow on rotting vegetation. Historically, a strain of Escherichia coli, OP50, was chosen as the standard laboratory diet. While most experimental studies use OP50 as the primary food source, the number of bacterial diets used to propagate C. elegans has expanded in recent years. Another laboratory E. coli strain that is routinely used is HB101. Although both of these bacterial diets support growth and development, the nutritional composition of each is unique, and when fed to C. elegans, differentially impacts several traits. Research has indicated that animals fed HB101 develop faster and are larger than those fed OP50. Unsurprisingly, these diet-induced growth effects are also accompanied by global changes in gene expression. Previously, researchers have analyzed the transcriptional response of animals fed differing diets (1,2).
Characterizing the expression changes of C. elegans fed different strains of bacteria expands the understanding of diet effects at the molecular level.
- Schumacker et al., 2019. RNA sequencing dataset characterizing transcriptomic responses to dietary changes in Caenorhabditis elegans.
- Stuhr & Curran, 2020. Bacterial diets differentially alter lifespan and health span trajectories in C. elegans
Schumacker workflow - Illumina NextSeq 500
Stuhr workflow - NovaSeq 600
Learning objectives
- Gain general understanding of and experience using different alignment methods (Bowtie/Kallisto)
- Handle processing of FastQ -> Bam -> count matrix
- Identify DEGs using popular packages (DESeq2, sleuth)
- Create pretty visualization
Project questions
- How does overall data from the two labs compare? Do I observe large “batch” effects?
- When using multiple alignment methods, do results vary?
- Do differentially expressed genes (DEGs) differ across the papers? How so? What genes are consistently picked out?
Analysis using Bowtie & DESeq2
Step 1. Trimming, mapping, and indexing
Bowtie: FastQ -> Bam/Bai
The first step of this analysis was taking data generated from a sequencer (FastQ file format), perform trimming, mapping, and indexing to obtain a Bam and Bai file for downstream analysis.
Example bioproject entries from Schumacker 2019
Data from the papers archived through Sequence Read Archive (SRA) is accessible on Quest using sratoolkit. I was able to download all necessary fastq files and compress them on quest
Using a bash script, I was able to run Bowtie to generate Bam/Bai files for all conditions.
Step 2. Bam -> Granges
Given the Bam file produced by running Bowtie, I then used the GRanges package in R to convert this data into a GRanges object.
# load saved GRanges data
gr <- readRDS("data/GrangesList.RDS")
class(gr)
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
names(gr)
## [1] "schumacker_HB101_rep1" "schumacker_HB101_rep2" "schumacker_OP50_rep1"
## [4] "schumacker_OP50_rep2" "stuhr_HB101_rep1" "stuhr_HB101_rep2"
## [7] "stuhr_HB101_rep3" "stuhr_OP50_rep1" "stuhr_OP50_rep2"
## [10] "stuhr_OP50_rep3"
Generate count matrix
# count over transcripts, grouped by gene
txbygn = transcriptsBy(TxDb.Celegans.UCSC.ce11.ensGene, by = 'gene')
counts.per.transcript = lapply(gr, function(x){
out = countOverlaps(txbygn, x)
return(out)})
counts.matrix = do.call('cbind', counts.per.transcript)
counts.matrix[1:4,1:2]
## schumacker_HB101_rep1 schumacker_HB101_rep2
## WBGene00000001.1 87 95
## WBGene00000002.1 11 6
## WBGene00000003.1 18 25
## WBGene00000004.1 9 6
Step 3. PCA
Perform PCA on data from both groups combined
Performed PCA on all genes using the stats::prcomp() function
OP50 replicate 1 for stuhr group is a HUGE outlier. Will remove this for further analysis
Looks like the difference between the two research groups explains a vast amount of the variation in the data (unsurprisingly). It is likely more appropriate to analyze the data from each group separately.
Split data into research groups and rerun PCA
Question – PC2 seems to describe the difference between replicates for both groups should I take PC2 into account for the DESeq model?
Well, is PC2 significant? Let’s consider the eigenvalues. The “Kaiser-Guttman” criterion for retaining principal components states that you only keep the ones whose eigenvalues are greater than the average eigenvalue.
## integer(0)
## integer(0)
Given this analysis, I can conclude that it is not necessary to account for PC2 when building DESeq models.
Step 4. Run DESeq2 on groups separately
schumacker.dds = DESeqDataSetFromMatrix(countData = counts.matrix[,1:4],
colData = schumacker.meta,
design = ~replicate + diet)
schumacker.dds = DESeq(schumacker.dds, parallel = FALSE)
schumacker.res = results(schumacker.dds, contrast = c("diet", "OP50", "HB101"),
independentFiltering = FALSE)
These plots look good. Let’s look at the associated volcano plots.
Question: do I see enrichment of upregulated genes between the data sets?
I can perform a “gene set” enrichment analysis using the Fisher’s Exact Test. I will have 4 categories: whether a gene is only up in Schumacker or Stuhr, whether a gene is up in both, and whether a gene is up in neither group.
#Does contingency table == size of full data set?
sum(contingency_bowtie) == nrow(schumacker.res)
## [1] TRUE
contingency_bowtie
## Stuhr !Stuhr
## Schumacker 18 151
## !Schumacker 938 45797
I can now run a Fisher’s exact test
ftest <- stats::fisher.test(contingency_bowtie)
ftest$p.value
## [1] 1.364319e-08
log2(ftest$estimate)
## odds ratio
## 2.540901
This odds ratio is greater than 1 indicating enrichment of “up genes” identified by Schumacker in the Stuhr data set.
## Warning: Expected 1 pieces. Additional pieces discarded in 18 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18].
Analysis using Kallisto & Sleuth
Although Bowtie is a widely used, fast aligner, there are many other options available. The Andersen Lab uses Kallisto to map reads from RNAseq experiments. Kallisto is different from Bowtie as it gives transcript level expression information. While Bowtie maps reads to a reference genome, Kallisto quantifies transcript abundances using pseudo alignment of reads to a graph transcriptome. For more information about how Kallisto works, see this blog post.
Step 1. FastQ -> count matrix
I was able to use a pipeline developed by the Andersen Lab to perform mapping with Kallisto. Here, instead of generating a Bam/Bai file as output, a count matrix is returned.
- pre trim QC using fastqc
- trim using fastp (instead of TrimGalore)
- generate transcriptome
- kallisto mapping and indexing (instead of Bowtie)
# load kallisto data
dir <- "data/kallisto/tsv"
files <- list.files(dir, full.names = TRUE)
kcounts <- purrr::map_dfr(files, ~readr::read_tsv(.x) %>%
dplyr::mutate(sample = str_remove(basename(.x), ".tsv")) %>%
tidyr::separate(sample, into = c("group", "diet", "replicate"),
sep="[[:punct:]]", remove=F))
head(kcounts, 3)
## # A tibble: 3 x 9
## target_id length eff_length est_counts tpm sample group diet replicate
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Y74C9A.6 163 36.6 1 1.19 schumack… schum… HB101 rep1
## 2 Y74C9A.3.1 892 725. 680 40.8 schumack… schum… HB101 rep1
## 3 Y74C9A.2a… 554 387. 0 0 schumack… schum… HB101 rep1
Step 2. PCA
Once again, it is clear that research group drives a lot of the variance in the data. Also clear that Stuhr’s OP50 replicate 1 should be excluded.
Step 3. Run Sleuth
First load metadata
dir <- "data/kallisto/h5"
files <- list.files(dir, full.names = TRUE)
# build metadata df
schumacker.meta <- kcounts %>%
dplyr::select(sample:replicate) %>%
unique() %>%
dplyr::mutate(path = files) %>%
dplyr::filter(group == "schumacker")
stuhr.meta <- kcounts %>%
dplyr::select(sample:replicate) %>%
unique() %>%
dplyr::mutate(path = files) %>%
dplyr::filter(group == "stuhr") %>%
dplyr::filter(!(diet == "OP50" & replicate == "rep1"))
head(stuhr.meta,3)
## # A tibble: 3 x 5
## sample group diet replicate path
## <chr> <chr> <chr> <chr> <chr>
## 1 stuhr_HB101_rep1 stuhr HB101 rep1 data/kallisto/h5/stuhr_HB101_rep1.h5
## 2 stuhr_HB101_rep2 stuhr HB101 rep2 data/kallisto/h5/stuhr_HB101_rep2.h5
## 3 stuhr_HB101_rep3 stuhr HB101 rep3 data/kallisto/h5/stuhr_HB101_rep3.h5
Associate transcripts to genes
Note: The sample quantifications performed by kallisto produce transcript abundance and count estimates. These can be parsed by R/sleuth, however sleuth does not “know” about genes. To perform gene-level analysis sleuth must parse a gene annotation.
# pulled this annotation file from the lab
t2g <- read.delim("data/WS276_t2g_all.tsv", stringsAsFactors = F) %>%
dplyr::filter(biotype %in% c("mRNA","pseudogenic_transcript")) %>%
dplyr::select(target_id = transcript, ens_gene, ext_gene)
head(t2g,4)
## target_id ens_gene ext_gene
## 1 Y74C9A.3.1 WBGene00022277 homt-1
## 2 Y74C9A.3.2 WBGene00022277 homt-1
## 3 Y74C9A.2a.1 WBGene00022276 nlp-40
## 4 Y74C9A.2a.2 WBGene00022276 nlp-40
Initialize sleuth object
The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters.
# set up sleuth objects
schumacker.so <- sleuth_prep(schumacker.meta,
full_model = ~replicate+diet,
target_mapping = t2g,
read_bootstrap_tpm = TRUE,
extra_bootstrap_summary = TRUE,
transformation_function = function(x) log2(x + 0.5)) %>%
# By specifying log2(x + 0.5) I ensure the output fold changes are log2.
sleuth_fit()
stuhr.so <- sleuth_prep(stuhr.meta,
full_model = ~replicate+diet,
target_mapping = t2g,
read_bootstrap_tpm = TRUE,
extra_bootstrap_summary = TRUE,
transformation_function = function(x) log2(x + 0.5)) %>%
sleuth_fit()
Analysis
I want to examine what genes are differentially expressed as a result of diet, while accounting for the differences that are caused by differences in replicate. I will use the Wald test.
# use wald test to identify DEGs
schumacker.oe <- sleuth_wt(schumacker.so, which_beta = 'dietOP50')
stuhr.oe <- sleuth_wt(stuhr.so, which_beta = 'dietOP50')
Get gene-level differential expression results
# collect results data into a df
schumacker.sleuthtable <- sleuth_results(schumacker.oe, test = 'dietOP50', show_all = TRUE)
stuhr.sleuthtable <- sleuth_results(stuhr.oe, test = 'dietOP50', show_all = TRUE)
Visualize the results
# density of counts
plot_group_density(schumacker.oe, use_filtered = F)
plot_group_density(schumacker.oe, use_filtered = T)
# volcano plot
p1 <- plot_volcano(schumacker.oe, "dietOP50") +
labs(title = "Schumacker volcano plot") +
annotate(geom = "text", x = 5, y = 100,
label = length(which(schumacker.sleuthtable$qval <= 0.05 &
schumacker.sleuthtable$b >= 1))) +
annotate(geom = "text", x = -5, y = 100,
label = length(which(schumacker.sleuthtable$qval <= 0.05 &
schumacker.sleuthtable$b <= -1)))
p2 <- plot_volcano(stuhr.oe, "dietOP50") +
labs(title = "Stuhr volcano plot") +
annotate(geom = "text", x = 5, y = 100,
label = length(which(stuhr.sleuthtable$qval <= 0.05 &
stuhr.sleuthtable$b >= 1))) +
annotate(geom = "text", x = -5, y = 100,
label = length(which(stuhr.sleuthtable$qval <= 0.05 &
stuhr.sleuthtable$b <= -1)))
p1/p2
## this chunk is not run
# this opens up an interactive shiny app to better explore the data
sleuth_live(stuhr.oe)
sleuth_live(schumacker.oe)
Contingency table
Contingency table from Bowtie/GRanges workflow
contingency_bowtie
## Stuhr !Stuhr
## Schumacker 18 151
## !Schumacker 938 45797
Contingency table from Kallisto/Sleuth workflow
contingency_kallisto
## Stuhr !Stuhr
## Schumacker 14 508
## !Schumacker 589 58943
ftest <- stats::fisher.test(contingency_kallisto)
ftest
##
## Fisher's Exact Test for Count Data
##
## data: contingency_kallisto
## p-value = 0.0009521
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.487701 4.707659
## sample estimates:
## odds ratio
## 2.757804
log2(ftest$estimate)
## odds ratio
## 1.46352
I still see enrichment of up regulated genes, although less so compared to Bowtie/GRanges results.
Is there overlap between the upregulated genes found using Kallisto/Sleuth and those found using Bowtie/GRanges?
Wrap Up
Takeaways
General RNAseq analysis pipeline
Project questions
How does overall data from the two labs compare? Do I observe large “batch” effects?
- Data from the two labs are quite different which is very clear when assessing the PCA plots. This isn’t entirely surprising, however. While the Stuhr group collected synchronized L4 individuals, the Schumacker group used a mixed stage population for their study. As a result, there are inherent differences in developmental timing of the sampled individuals which can certainly drive differences in gene expression.
When using multiple alignment methods, do results vary? How so?
- I would say not so much. The necessary steps for downstream processing differ substantially but I think overall the data itself is comparable.
Do differentially expressed genes (DEGs) differ across the papers? How so? What genes are consistently picked out?
- They do differ! (Unsurprisingly of course given the differences in experimental workflow). But there are also genes (very few) that are found consistently between the two groups. I’ve shown these above.
Future directions
Dig into the data from a single group (ie. Schumacker using Kallisto). This could help the investigation of classes of genes that drive similar biological processes.
sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] DT_0.20
## [2] sleuth_0.30.0
## [3] patchwork_1.1.1
## [4] DESeq2_1.30.1
## [5] TxDb.Celegans.UCSC.ce11.ensGene_3.12.0
## [6] GenomicFeatures_1.42.3
## [7] AnnotationDbi_1.52.0
## [8] ggthemes_4.2.4
## [9] forcats_0.5.1
## [10] stringr_1.4.0
## [11] dplyr_1.0.6
## [12] purrr_0.3.4
## [13] readr_1.4.0
## [14] tidyr_1.1.3
## [15] tibble_3.1.1
## [16] ggplot2_3.3.3
## [17] tidyverse_1.3.1
## [18] BSgenome.Celegans.UCSC.ce11_1.4.2
## [19] BSgenome_1.58.0
## [20] rtracklayer_1.50.0
## [21] GenomicAlignments_1.26.0
## [22] Rsamtools_2.6.0
## [23] Biostrings_2.58.0
## [24] XVector_0.30.0
## [25] SummarizedExperiment_1.20.0
## [26] Biobase_2.50.0
## [27] MatrixGenerics_1.2.1
## [28] matrixStats_0.59.0
## [29] GenomicRanges_1.42.0
## [30] GenomeInfoDb_1.26.7
## [31] IRanges_2.24.1
## [32] S4Vectors_0.28.1
## [33] BiocGenerics_0.36.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-1 ellipsis_0.3.2 fs_1.5.0
## [4] rstudioapi_0.13 farver_2.1.0 bit64_4.0.5
## [7] fansi_0.4.2 lubridate_1.7.10 xml2_1.3.2
## [10] splines_4.0.5 cachem_1.0.5 geneplotter_1.68.0
## [13] knitr_1.33 jsonlite_1.7.2 broom_0.7.6
## [16] annotate_1.68.0 dbplyr_2.1.1 pheatmap_1.0.12
## [19] compiler_4.0.5 httr_1.4.2 backports_1.2.1
## [22] lazyeval_0.2.2 assertthat_0.2.1 Matrix_1.3-4
## [25] fastmap_1.1.0 cli_2.5.0 htmltools_0.5.1.1
## [28] prettyunits_1.1.1 tools_4.0.5 gtable_0.3.0
## [31] glue_1.4.2 GenomeInfoDbData_1.2.4 reshape2_1.4.4
## [34] rappdirs_0.3.3 Rcpp_1.0.7 cellranger_1.1.0
## [37] jquerylib_0.1.4 rhdf5filters_1.2.1 vctrs_0.3.8
## [40] crosstalk_1.1.1 xfun_0.22 rvest_1.0.0
## [43] lifecycle_1.0.0 XML_3.99-0.8 zlibbioc_1.36.0
## [46] scales_1.1.1 hms_1.0.0 rhdf5_2.34.0
## [49] RColorBrewer_1.1-2 yaml_2.2.1 curl_4.3.1
## [52] gridExtra_2.3 memoise_2.0.0 sass_0.4.0
## [55] biomaRt_2.46.3 stringi_1.5.3 RSQLite_2.2.8
## [58] highr_0.9 genefilter_1.72.1 BiocParallel_1.24.1
## [61] rlang_0.4.11 pkgconfig_2.0.3 bitops_1.0-7
## [64] evaluate_0.14 lattice_0.20-41 Rhdf5lib_1.12.1
## [67] labeling_0.4.2 htmlwidgets_1.5.3 bit_4.0.4
## [70] tidyselect_1.1.1 plyr_1.8.6 magrittr_2.0.1
## [73] bookdown_0.22 R6_2.5.0 generics_0.1.0
## [76] DelayedArray_0.16.3 DBI_1.1.1 pillar_1.6.0
## [79] haven_2.4.1 withr_2.4.2 survival_3.2-10
## [82] RCurl_1.98-1.3 modelr_0.1.8 crayon_1.4.1
## [85] utf8_1.2.1 BiocFileCache_1.14.0 rmarkdown_2.11
## [88] progress_1.2.2 locfit_1.5-9.4 grid_4.0.5
## [91] readxl_1.3.1 data.table_1.14.0 blob_1.2.1
## [94] rmdformats_1.0.3 reprex_2.0.0 digest_0.6.27
## [97] xtable_1.8-4 openssl_1.4.4 munsell_0.5.0
## [100] bslib_0.2.5.1 askpass_1.1