Final Presentation

Joy Nyaanga

2021-12-02

Introduction

As bacterivores, C. elegans feed on bacteria that grow on rotting vegetation. Historically, a strain of Escherichia coli, OP50, was chosen as the standard laboratory diet. While most experimental studies use OP50 as the primary food source, the number of bacterial diets used to propagate C. elegans has expanded in recent years. Another laboratory E. coli strain that is routinely used is HB101. Although both of these bacterial diets support growth and development, the nutritional composition of each is unique, and when fed to C. elegans, differentially impacts several traits. Research has indicated that animals fed HB101 develop faster and are larger than those fed OP50. Unsurprisingly, these diet-induced growth effects are also accompanied by global changes in gene expression. Previously, researchers have analyzed the transcriptional response of animals fed differing diets (1,2).

Characterizing the expression changes of C. elegans fed different strains of bacteria expands the understanding of diet effects at the molecular level.

Schumacker et al., 2019. RNA sequencing dataset characterizing transcriptomic responses to dietary changes in Caenorhabditis elegans.
Stuhr & Curran, 2020. Bacterial diets differentially alter lifespan and health span trajectories in C. elegans

Schumacker workflow - Illumina NextSeq 500

Stuhr workflow - NovaSeq 600

Learning objectives

Gain general understanding of and experience using different alignment methods (Bowtie/Kallisto)
Handle processing of FastQ -> Bam -> count matrix
Identify DEGs using popular packages (DESeq2, sleuth)
Create pretty visualization

Project questions

How does overall data from the two labs compare? Do I observe large “batch” effects?
When using multiple alignment methods, do results vary?
Do differentially expressed genes (DEGs) differ across the papers? How so? What genes are consistently picked out?

Analysis using Bowtie & DESeq2

Step 1. Trimming, mapping, and indexing

Bowtie: FastQ -> Bam/Bai

The first step of this analysis was taking data generated from a sequencer (FastQ file format), perform trimming, mapping, and indexing to obtain a Bam and Bai file for downstream analysis.

Example bioproject entries from Schumacker 2019

Data from the papers archived through Sequence Read Archive (SRA) is accessible on Quest using sratoolkit. I was able to download all necessary fastq files and compress them on quest

Using a bash script, I was able to run Bowtie to generate Bam/Bai files for all conditions.

Step 2. Bam -> Granges

Given the Bam file produced by running Bowtie, I then used the GRanges package in R to convert this data into a GRanges object.

# load saved GRanges data
gr <- readRDS("data/GrangesList.RDS")
class(gr)

## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"

names(gr)

##  [1] "schumacker_HB101_rep1" "schumacker_HB101_rep2" "schumacker_OP50_rep1" 
##  [4] "schumacker_OP50_rep2"  "stuhr_HB101_rep1"      "stuhr_HB101_rep2"     
##  [7] "stuhr_HB101_rep3"      "stuhr_OP50_rep1"       "stuhr_OP50_rep2"      
## [10] "stuhr_OP50_rep3"

Generate count matrix

# count over transcripts, grouped by gene
txbygn = transcriptsBy(TxDb.Celegans.UCSC.ce11.ensGene, by = 'gene')

counts.per.transcript = lapply(gr, function(x){
    out = countOverlaps(txbygn, x)
    return(out)})

counts.matrix = do.call('cbind', counts.per.transcript)
counts.matrix[1:4,1:2]

##                  schumacker_HB101_rep1 schumacker_HB101_rep2
## WBGene00000001.1                    87                    95
## WBGene00000002.1                    11                     6
## WBGene00000003.1                    18                    25
## WBGene00000004.1                     9                     6

Step 3. PCA

Perform PCA on data from both groups combined

Performed PCA on all genes using the stats::prcomp() function

OP50 replicate 1 for stuhr group is a HUGE outlier. Will remove this for further analysis

Looks like the difference between the two research groups explains a vast amount of the variation in the data (unsurprisingly). It is likely more appropriate to analyze the data from each group separately.

Split data into research groups and rerun PCA

Question – PC2 seems to describe the difference between replicates for both groups should I take PC2 into account for the DESeq model?

Well, is PC2 significant? Let’s consider the eigenvalues. The “Kaiser-Guttman” criterion for retaining principal components states that you only keep the ones whose eigenvalues are greater than the average eigenvalue.

## integer(0)

## integer(0)

Given this analysis, I can conclude that it is not necessary to account for PC2 when building DESeq models.

Step 4. Run DESeq2 on groups separately

schumacker.dds = DESeqDataSetFromMatrix(countData = counts.matrix[,1:4], 
                                        colData = schumacker.meta, 
                                        design = ~replicate + diet)

schumacker.dds = DESeq(schumacker.dds, parallel = FALSE)

schumacker.res = results(schumacker.dds, contrast = c("diet", "OP50", "HB101"),
                         independentFiltering = FALSE)

These plots look good. Let’s look at the associated volcano plots.

Question: do I see enrichment of upregulated genes between the data sets?

I can perform a “gene set” enrichment analysis using the Fisher’s Exact Test. I will have 4 categories: whether a gene is only up in Schumacker or Stuhr, whether a gene is up in both, and whether a gene is up in neither group.

#Does contingency table == size of full data set?
sum(contingency_bowtie) == nrow(schumacker.res)

## [1] TRUE

contingency_bowtie

##             Stuhr !Stuhr
## Schumacker     18    151
## !Schumacker   938  45797

I can now run a Fisher’s exact test

ftest <- stats::fisher.test(contingency_bowtie)
ftest$p.value

## [1] 1.364319e-08

log2(ftest$estimate)

## odds ratio 
##   2.540901

This odds ratio is greater than 1 indicating enrichment of “up genes” identified by Schumacker in the Stuhr data set.

## Warning: Expected 1 pieces. Additional pieces discarded in 18 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18].

Analysis using Kallisto & Sleuth

Although Bowtie is a widely used, fast aligner, there are many other options available. The Andersen Lab uses Kallisto to map reads from RNAseq experiments. Kallisto is different from Bowtie as it gives transcript level expression information. While Bowtie maps reads to a reference genome, Kallisto quantifies transcript abundances using pseudo alignment of reads to a graph transcriptome. For more information about how Kallisto works, see this blog post.

Step 1. FastQ -> count matrix

I was able to use a pipeline developed by the Andersen Lab to perform mapping with Kallisto. Here, instead of generating a Bam/Bai file as output, a count matrix is returned.

pre trim QC using fastqc
trim using fastp (instead of TrimGalore)
generate transcriptome
kallisto mapping and indexing (instead of Bowtie)

# load kallisto data
dir <- "data/kallisto/tsv"
files <- list.files(dir, full.names = TRUE)

kcounts <- purrr::map_dfr(files, ~readr::read_tsv(.x) %>%
                            dplyr::mutate(sample = str_remove(basename(.x), ".tsv")) %>%
                            tidyr::separate(sample, into = c("group", "diet", "replicate"), 
                                            sep="[[:punct:]]", remove=F))

head(kcounts, 3)

## # A tibble: 3 x 9
##   target_id  length eff_length est_counts   tpm sample    group  diet  replicate
##   <chr>       <dbl>      <dbl>      <dbl> <dbl> <chr>     <chr>  <chr> <chr>    
## 1 Y74C9A.6      163       36.6          1  1.19 schumack… schum… HB101 rep1     
## 2 Y74C9A.3.1    892      725.         680 40.8  schumack… schum… HB101 rep1     
## 3 Y74C9A.2a…    554      387.           0  0    schumack… schum… HB101 rep1

Step 2. PCA

Once again, it is clear that research group drives a lot of the variance in the data. Also clear that Stuhr’s OP50 replicate 1 should be excluded.

Step 3. Run Sleuth

First load metadata

dir <- "data/kallisto/h5"
files <- list.files(dir, full.names = TRUE)

# build metadata df
schumacker.meta <- kcounts %>%
  dplyr::select(sample:replicate) %>%
  unique() %>%
  dplyr::mutate(path = files) %>%
  dplyr::filter(group == "schumacker")

stuhr.meta <- kcounts %>%
  dplyr::select(sample:replicate) %>%
  unique() %>%
  dplyr::mutate(path = files) %>%
  dplyr::filter(group == "stuhr") %>%
  dplyr::filter(!(diet == "OP50" & replicate == "rep1"))

head(stuhr.meta,3)

## # A tibble: 3 x 5
##   sample           group diet  replicate path                                
##   <chr>            <chr> <chr> <chr>     <chr>                               
## 1 stuhr_HB101_rep1 stuhr HB101 rep1      data/kallisto/h5/stuhr_HB101_rep1.h5
## 2 stuhr_HB101_rep2 stuhr HB101 rep2      data/kallisto/h5/stuhr_HB101_rep2.h5
## 3 stuhr_HB101_rep3 stuhr HB101 rep3      data/kallisto/h5/stuhr_HB101_rep3.h5

Associate transcripts to genes

Note: The sample quantifications performed by kallisto produce transcript abundance and count estimates. These can be parsed by R/sleuth, however sleuth does not “know” about genes. To perform gene-level analysis sleuth must parse a gene annotation.

# pulled this annotation file from the lab
t2g <- read.delim("data/WS276_t2g_all.tsv", stringsAsFactors = F) %>%
  dplyr::filter(biotype %in% c("mRNA","pseudogenic_transcript")) %>%
  dplyr::select(target_id = transcript, ens_gene, ext_gene)
head(t2g,4)

##     target_id       ens_gene ext_gene
## 1  Y74C9A.3.1 WBGene00022277   homt-1
## 2  Y74C9A.3.2 WBGene00022277   homt-1
## 3 Y74C9A.2a.1 WBGene00022276   nlp-40
## 4 Y74C9A.2a.2 WBGene00022276   nlp-40

Initialize sleuth object

The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters.

# set up sleuth objects
schumacker.so <- sleuth_prep(schumacker.meta,
                             full_model = ~replicate+diet,
                             target_mapping = t2g,
                             read_bootstrap_tpm = TRUE,
                             extra_bootstrap_summary = TRUE,
                             transformation_function = function(x) log2(x + 0.5)) %>%
  # By specifying log2(x + 0.5) I ensure the output fold changes are log2.
  sleuth_fit()

stuhr.so <- sleuth_prep(stuhr.meta, 
                        full_model = ~replicate+diet,
                        target_mapping = t2g,
                        read_bootstrap_tpm = TRUE,
                        extra_bootstrap_summary = TRUE,
                        transformation_function = function(x) log2(x + 0.5)) %>%
  sleuth_fit()

Analysis

I want to examine what genes are differentially expressed as a result of diet, while accounting for the differences that are caused by differences in replicate. I will use the Wald test.

# use wald test to identify DEGs
schumacker.oe <- sleuth_wt(schumacker.so, which_beta = 'dietOP50')
stuhr.oe <- sleuth_wt(stuhr.so, which_beta = 'dietOP50')

Get gene-level differential expression results

# collect results data into a df
schumacker.sleuthtable <- sleuth_results(schumacker.oe, test = 'dietOP50', show_all = TRUE)

stuhr.sleuthtable <- sleuth_results(stuhr.oe, test = 'dietOP50', show_all = TRUE)

Visualize the results

# density of counts
plot_group_density(schumacker.oe, use_filtered = F)

plot_group_density(schumacker.oe, use_filtered = T)

# volcano plot
p1 <- plot_volcano(schumacker.oe, "dietOP50") + 
  labs(title = "Schumacker volcano plot") +
  annotate(geom = "text", x = 5, y = 100, 
           label = length(which(schumacker.sleuthtable$qval <= 0.05 & 
                                  schumacker.sleuthtable$b >= 1))) +
  annotate(geom = "text", x = -5, y = 100, 
           label = length(which(schumacker.sleuthtable$qval <= 0.05 & 
                                  schumacker.sleuthtable$b <= -1))) 

p2 <- plot_volcano(stuhr.oe, "dietOP50") + 
  labs(title = "Stuhr volcano plot") +
  annotate(geom = "text", x = 5, y = 100, 
           label = length(which(stuhr.sleuthtable$qval <= 0.05 & 
                                  stuhr.sleuthtable$b >= 1))) +
  annotate(geom = "text", x = -5, y = 100, 
           label = length(which(stuhr.sleuthtable$qval <= 0.05 & 
                                  stuhr.sleuthtable$b <= -1))) 
p1/p2

## this chunk is not run
# this opens up an interactive shiny app to better explore the data
sleuth_live(stuhr.oe)
sleuth_live(schumacker.oe)

Contingency table

Contingency table from Bowtie/GRanges workflow

contingency_bowtie

##             Stuhr !Stuhr
## Schumacker     18    151
## !Schumacker   938  45797

Contingency table from Kallisto/Sleuth workflow

contingency_kallisto

##             Stuhr !Stuhr
## Schumacker     14    508
## !Schumacker   589  58943

ftest <- stats::fisher.test(contingency_kallisto)
ftest

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_kallisto
## p-value = 0.0009521
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.487701 4.707659
## sample estimates:
## odds ratio 
##   2.757804

log2(ftest$estimate)

## odds ratio 
##    1.46352

I still see enrichment of up regulated genes, although less so compared to Bowtie/GRanges results.

Is there overlap between the upregulated genes found using Kallisto/Sleuth and those found using Bowtie/GRanges?

Wrap Up

Takeaways

General RNAseq analysis pipeline

Project questions

How does overall data from the two labs compare? Do I observe large “batch” effects?
- Data from the two labs are quite different which is very clear when assessing the PCA plots. This isn’t entirely surprising, however. While the Stuhr group collected synchronized L4 individuals, the Schumacker group used a mixed stage population for their study. As a result, there are inherent differences in developmental timing of the sampled individuals which can certainly drive differences in gene expression.
When using multiple alignment methods, do results vary? How so?
- I would say not so much. The necessary steps for downstream processing differ substantially but I think overall the data itself is comparable.
Do differentially expressed genes (DEGs) differ across the papers? How so? What genes are consistently picked out?
- They do differ! (Unsurprisingly of course given the differences in experimental workflow). But there are also genes (very few) that are found consistently between the two groups. I’ve shown these above.

Future directions

Dig into the data from a single group (ie. Schumacker using Kallisto). This could help the investigation of classes of genes that drive similar biological processes.

sessionInfo()

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] DT_0.20                               
##  [2] sleuth_0.30.0                         
##  [3] patchwork_1.1.1                       
##  [4] DESeq2_1.30.1                         
##  [5] TxDb.Celegans.UCSC.ce11.ensGene_3.12.0
##  [6] GenomicFeatures_1.42.3                
##  [7] AnnotationDbi_1.52.0                  
##  [8] ggthemes_4.2.4                        
##  [9] forcats_0.5.1                         
## [10] stringr_1.4.0                         
## [11] dplyr_1.0.6                           
## [12] purrr_0.3.4                           
## [13] readr_1.4.0                           
## [14] tidyr_1.1.3                           
## [15] tibble_3.1.1                          
## [16] ggplot2_3.3.3                         
## [17] tidyverse_1.3.1                       
## [18] BSgenome.Celegans.UCSC.ce11_1.4.2     
## [19] BSgenome_1.58.0                       
## [20] rtracklayer_1.50.0                    
## [21] GenomicAlignments_1.26.0              
## [22] Rsamtools_2.6.0                       
## [23] Biostrings_2.58.0                     
## [24] XVector_0.30.0                        
## [25] SummarizedExperiment_1.20.0           
## [26] Biobase_2.50.0                        
## [27] MatrixGenerics_1.2.1                  
## [28] matrixStats_0.59.0                    
## [29] GenomicRanges_1.42.0                  
## [30] GenomeInfoDb_1.26.7                   
## [31] IRanges_2.24.1                        
## [32] S4Vectors_0.28.1                      
## [33] BiocGenerics_0.36.1                   
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-1       ellipsis_0.3.2         fs_1.5.0              
##   [4] rstudioapi_0.13        farver_2.1.0           bit64_4.0.5           
##   [7] fansi_0.4.2            lubridate_1.7.10       xml2_1.3.2            
##  [10] splines_4.0.5          cachem_1.0.5           geneplotter_1.68.0    
##  [13] knitr_1.33             jsonlite_1.7.2         broom_0.7.6           
##  [16] annotate_1.68.0        dbplyr_2.1.1           pheatmap_1.0.12       
##  [19] compiler_4.0.5         httr_1.4.2             backports_1.2.1       
##  [22] lazyeval_0.2.2         assertthat_0.2.1       Matrix_1.3-4          
##  [25] fastmap_1.1.0          cli_2.5.0              htmltools_0.5.1.1     
##  [28] prettyunits_1.1.1      tools_4.0.5            gtable_0.3.0          
##  [31] glue_1.4.2             GenomeInfoDbData_1.2.4 reshape2_1.4.4        
##  [34] rappdirs_0.3.3         Rcpp_1.0.7             cellranger_1.1.0      
##  [37] jquerylib_0.1.4        rhdf5filters_1.2.1     vctrs_0.3.8           
##  [40] crosstalk_1.1.1        xfun_0.22              rvest_1.0.0           
##  [43] lifecycle_1.0.0        XML_3.99-0.8           zlibbioc_1.36.0       
##  [46] scales_1.1.1           hms_1.0.0              rhdf5_2.34.0          
##  [49] RColorBrewer_1.1-2     yaml_2.2.1             curl_4.3.1            
##  [52] gridExtra_2.3          memoise_2.0.0          sass_0.4.0            
##  [55] biomaRt_2.46.3         stringi_1.5.3          RSQLite_2.2.8         
##  [58] highr_0.9              genefilter_1.72.1      BiocParallel_1.24.1   
##  [61] rlang_0.4.11           pkgconfig_2.0.3        bitops_1.0-7          
##  [64] evaluate_0.14          lattice_0.20-41        Rhdf5lib_1.12.1       
##  [67] labeling_0.4.2         htmlwidgets_1.5.3      bit_4.0.4             
##  [70] tidyselect_1.1.1       plyr_1.8.6             magrittr_2.0.1        
##  [73] bookdown_0.22          R6_2.5.0               generics_0.1.0        
##  [76] DelayedArray_0.16.3    DBI_1.1.1              pillar_1.6.0          
##  [79] haven_2.4.1            withr_2.4.2            survival_3.2-10       
##  [82] RCurl_1.98-1.3         modelr_0.1.8           crayon_1.4.1          
##  [85] utf8_1.2.1             BiocFileCache_1.14.0   rmarkdown_2.11        
##  [88] progress_1.2.2         locfit_1.5-9.4         grid_4.0.5            
##  [91] readxl_1.3.1           data.table_1.14.0      blob_1.2.1            
##  [94] rmdformats_1.0.3       reprex_2.0.0           digest_0.6.27         
##  [97] xtable_1.8-4           openssl_1.4.4          munsell_0.5.0         
## [100] bslib_0.2.5.1          askpass_1.1