Introduction

sleuth is a tool for the analysis and comparison of multiple related RNA-Seq experiments. Key features include:

To use sleuth, RNA-Seq data must first be quantified with kallisto, which is a program for very fast RNA-Seq quantification based on pseudo-alignment. An important feature of kallisto is that it outputs bootstraps along with the estimates of transcript abundances. These can serve as proxies for technical replicates, allowing for an ascertainment of the variability in estimates due to the random processes underlying RNA-Seq as well as the statistical procedure of read assignment. kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. sleuth has been designed to work seamlessly and efficiently with kallisto, and therefore RNA-Seq analysis with kallisto and sleuth is tractable on a laptop computer in a matter of minutes. More details about kallisto and sleuth are provided the papers describing the methods:

Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527 (2016), doi:10.1038/nbt.3519

Harold Pimentel, Nicolas L Bray, Suzette Puente, Páll Melsted and Lior Pachter, Differential analysis of RNA-seq incorporating quantification uncertainty, in press.

sleuth has been designed to facilitate the exploration of RNA-Seq data by utilizing the Shiny web application framework by RStudio. The worked example below illustrates how to load data into sleuth and how to open Shiny plots for exploratory data analysis. The code underlying all plots is available via the Shiny interface so that analyses can be fully “open source”.

Follow the steps; copy paste the command on the R studio console and hit enter to run the code

Sleuth analysis steps

  1. set the working directory
base_dir <- "/home/kapeelc/sleuth_analysis"
  1. set the number of CPU’s to use for the analysis; set it according to your instance size
options(mc.cores = 4L)
  1. install ballgown R package along with its dependencies
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
install.packages("devtools", repos = "http://cran.us.r-project.org")
library("httr")
set_config(config(ssl_verifypeer = 0L))
devtools::install_github("pachterlab/sleuth")
library("sleuth")
  1. set the base directory where the kallisto output files are
base_dir <- "/home/kapeelc/sleuth_analysis"
  1. The first step in a sleuth analysis is to specify where the kallisto results are stored. A variable is created for this purpose with
sample_id <- dir(file.path(base_dir,"kallisto_qaunt_output"))
  1. The result can be displayed by typing
sample_id
## [1] "IS20351_DS_1_1" "IS20351_DS_2_1" "IS20351_DS_3_1" "IS20351_WW_1_1"
## [5] "IS20351_WW_2_1" "IS20351_WW_3_1"

In the box above, lines beginning with ## show the output of the command (in what follows we include the output that should appear with each command).

  1. A list of paths to the kallisto results indexed by the sample IDs is collated with
kal_dirs <- sapply(sample_id, function(id) file.path(base_dir, "kallisto_qaunt_output",id))
kal_dirs
##                                                       IS20351_DS_1_1 
## "/home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_DS_1_1" 
##                                                       IS20351_DS_2_1 
## "/home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_DS_2_1" 
##                                                       IS20351_DS_3_1 
## "/home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_DS_3_1" 
##                                                       IS20351_WW_1_1 
## "/home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_WW_1_1" 
##                                                       IS20351_WW_2_1 
## "/home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_WW_2_1" 
##                                                       IS20351_WW_3_1 
## "/home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_WW_3_1"
  1. The next step is to load an auxillary table that describes the experimental design and the relationship between the kallisto directories and the samples
s2c <- read.table(file.path(base_dir, "design_matrix"), header = TRUE, stringsAsFactors=FALSE)
s2c <- dplyr::select(s2c, sample = ID, condition, reps)
s2c
##           sample condition reps
## 1 IS20351_DS_1_1        DS    1
## 2 IS20351_DS_2_1        DS    2
## 3 IS20351_DS_3_1        DS    3
## 4 IS20351_WW_1_1        WW    1
## 5 IS20351_WW_2_1        WW    2
## 6 IS20351_WW_3_1        WW    3
  1. Now the directories must be appended in a new column to the table describing the experiment. This column must be labeled path, otherwise sleuth will report an error. This is to ensure that samples can be associated with kallisto quantifications.
s2c <- dplyr::mutate(s2c, path = kal_dirs)
print(s2c)
##           sample condition reps
## 1 IS20351_DS_1_1        DS    1
## 2 IS20351_DS_2_1        DS    2
## 3 IS20351_DS_3_1        DS    3
## 4 IS20351_WW_1_1        WW    1
## 5 IS20351_WW_2_1        WW    2
## 6 IS20351_WW_3_1        WW    3
##                                                                 path
## 1 /home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_DS_1_1
## 2 /home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_DS_2_1
## 3 /home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_DS_3_1
## 4 /home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_WW_1_1
## 5 /home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_WW_2_1
## 6 /home/kapeelc/sleuth_analysis/kallisto_qaunt_output/IS20351_WW_3_1
  1. Include gene names to transcript level analysis. You can use Biomart function to get the transcipt to gene mapping or use custom gff parsing script. Here we already have the mapping file done, make sure the column names are as shown below
target_id ens_gene
Sb10g000101.1 Sb10g000101
Sb10g000200.1 Sb10g000200
Sb10g000210.1 Sb10g000210
Sb10g000220.1 Sb10g000220
Sb10g000230.1 Sb10g000230
t2g <- read.table("/home/kapeelc/sleuth_analysis/t2g.txt", header = TRUE, stringsAsFactors=FALSE)
  1. Next, the “sleuth object” can be constructed. This object will store not only the information about the experiment, but also details of the model to be used for differential testing, and the results. It is prepared and used with four commands that (1) load the kallisto processed data into the object (2) estimate parameters for the sleuth response error measurement (full) model (3) estimate parameters for the sleuth reduced model, and (4) perform differential analysis (testing) using the likelihood ratio test The sleuth object must first be initialized with
so <- sleuth_prep(s2c, ~ condition, target_mapping = t2g, extra_bootstrap_summary = TRUE)
## Warning in check_num_cores(num_cores): It appears that you are running Sleuth from within Rstudio.
## Because of concerns with forking processes from a GUI, 'num_cores' is being set to 1.
## If you wish to take advantage of multiple cores, please consider running sleuth from the command line.
## reading in kallisto results
## dropping unused factor levels
## ......
## 
## normalizing est_counts
## 22062 targets passed the filter
## normalizing tpm
## merging in metadata
## aggregating by column: ens_gene
## 20635 genes passed the filter
## summarizing bootstraps
## ......
## 
  1. Then the full model is fit with
so <- sleuth_fit(so)
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
  1. What this has accomplished is to “smooth” the raw kallisto abundance estimates for each sample using a linear model with a parameter that represents the experimental condition (in this case DS vs. WW). To test for transcripts that are differential expressed between the conditions, sleuth performs a second fit to a “reduced” model that presumes abundances are equal in the two conditions. To identify differential expressed transcripts sleuth will then identify transcripts with a significantly better fit with the “full” model.

Sleuth likelihood ratio test(LRT) The “reduced” model is fit with

so <- sleuth_fit(so, ~1, 'reduced')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
  1. and the test is performed with
so <- sleuth_lrt(so, 'reduced', 'full')
  1. The models that have been fit can always be examined with the models() function.
models(so)
## [  full  ]
## formula:  ~condition 
## coefficients:
##  (Intercept)
##      conditionWW
## [  reduced  ]
## formula:  ~1 
## coefficients:
##  (Intercept)
  1. Results using Sleuth LRT test
sleuth_table <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE)
sleuth_significant <- dplyr::filter(sleuth_table, qval <= 0.05)
head(sleuth_significant, 20)
##      target_id test_stat         pval        qval       rss  sigma_sq
## 1  Sb02g029940  29.58808 5.343242e-08 0.001101830 39.422813 7.8576050
## 2  Sb09g029170  27.15144 1.881247e-07 0.001939660 10.557922 2.1108217
## 3  Sb01g027570  24.01893 9.539320e-07 0.003278505  5.618592 1.1224756
## 4  Sb01g049480  24.34469 8.054829e-07 0.003278505 12.923175 2.5750263
## 5  Sb02g029500  24.50740 7.402510e-07 0.003278505 10.706748 2.1375197
## 6  Sb06g030910  24.73321 6.583969e-07 0.003278505 11.669192 2.3314440
## 7  Sb06g029520  22.88469 1.720163e-06 0.004433935 17.065528 3.3890648
## 8  Sb10g030520  22.92296 1.686259e-06 0.004433935  6.138199 1.2239239
## 9  Sb03g032450  22.39862 2.215335e-06 0.005075825  7.215683 1.4382912
## 10 Sb01g017890  20.84643 4.976180e-06 0.005130690  8.474861 1.6939772
## 11 Sb02g000230  20.90889 4.816546e-06 0.005130690  7.698030 1.5370680
## 12 Sb02g043270  21.00442 4.582251e-06 0.005130690  4.388220 0.8739358
## 13 Sb03g030350  20.92360 4.779688e-06 0.005130690  8.481503 1.6917119
## 14 Sb04g009380  21.16492 4.214066e-06 0.005130690  5.816588 1.1621681
## 15 Sb04g022780  21.24101 4.050055e-06 0.005130690  7.173298 1.4294987
## 16 Sb04g023540  20.91532 4.800410e-06 0.005130690  3.504512 0.7001695
## 17 Sb05g023970  21.24897 4.033265e-06 0.005130690  4.871567 0.9706728
## 18 Sb06g031460  21.97864 2.757017e-06 0.005130690  7.486657 1.4968000
## 19 Sb06g032740  20.90139 4.835441e-06 0.005130690  5.646188 1.1290081
## 20 Sb09g022360  20.97467 4.653966e-06 0.005130690  4.714452 0.9409795
##        tech_var mean_obs   var_obs sigma_sq_pmax smooth_sigma_sq
## 1  0.0269575462 5.437268 7.8845625     7.8576050      0.10042321
## 2  0.0007626147 7.816658 2.1115843     2.1108217      0.09751482
## 3  0.0012428359 7.136449 1.1237184     1.1224756      0.08525765
## 4  0.0096086891 5.521946 2.5846350     2.5750263      0.09748964
## 5  0.0038299837 6.236925 2.1413497     2.1375197      0.08200889
## 6  0.0023944573 6.651215 2.3338385     2.3314440      0.08053236
## 7  0.0240408402 4.965810 3.4131056     3.3890648      0.12062126
## 8  0.0037157803 6.132523 1.2276397     1.2239239      0.08323863
## 9  0.0048453369 5.740736 1.4431366     1.4382912      0.09070462
## 10 0.0009949900 7.637174 1.6949722     1.6939772      0.09389996
## 11 0.0025378948 6.291518 1.5396059     1.5370680      0.08148414
## 12 0.0037082658 5.776490 0.8776440     0.8739358      0.08975347
## 13 0.0045887113 5.910535 1.6963006     1.6917119      0.08670741
## 14 0.0011494398 7.059961 1.1633176     1.1621681      0.08418323
## 15 0.0051609291 5.748936 1.4346596     1.4294987      0.09048182
## 16 0.0007329007 7.553536 0.7009024     0.7001695      0.09231031
## 17 0.0036406368 5.916088 0.9743134     0.9706728      0.08660073
## 18 0.0005313628 8.060806 1.4973313     1.4968000      0.10290909
## 19 0.0002294607 8.834977 1.1292376     1.1290081      0.12430690
## 20 0.0019108458 6.744899 0.9428903     0.9409795      0.08099911
##    final_sigma_sq degrees_free
## 1       7.8576050            1
## 2       2.1108217            1
## 3       1.1224756            1
## 4       2.5750263            1
## 5       2.1375197            1
## 6       2.3314440            1
## 7       3.3890648            1
## 8       1.2239239            1
## 9       1.4382912            1
## 10      1.6939772            1
## 11      1.5370680            1
## 12      0.8739358            1
## 13      1.6917119            1
## 14      1.1621681            1
## 15      1.4294987            1
## 16      0.7001695            1
## 17      0.9706728            1
## 18      1.4968000            1
## 19      1.1290081            1
## 20      0.9409795            1
  1. Sleuth Wald test: This function computes the Wald test on one specific ‘beta’ coefficient on every transcript; this offers beta values which are biased folds changes values for pairwise comparisons
so <- sleuth_wt(so,'conditionWW','full')
  1. Results using Sleuth Wald test
results_table <- sleuth_results(so, 'conditionWW', test_type = 'wt')
sleuth_significant <- dplyr::filter(results_table, qval <= 0.05)
head(sleuth_significant, 20)
##      target_id         pval         qval         b      se_b mean_obs
## 1  Sb02g029940 1.480105e-97 3.052125e-93  5.113227 0.2439366 5.437268
## 2  Sb06g030910 7.288914e-58 7.515235e-54 -2.767723 0.1726061 6.651215
## 3  Sb09g029170 1.659953e-54 1.140996e-50  2.646767 0.1702398 7.816658
## 4  Sb02g029500 6.363724e-48 3.280659e-44 -2.652629 0.1823849 6.236925
## 5  Sb01g049480 9.125599e-42 3.763579e-38  2.916635 0.2154148 5.521946
## 6  Sb06g029520 1.714308e-37 5.891791e-34 -3.340866 0.2610758 4.965810
## 7  Sb06g031460 6.880121e-37 2.026785e-33  2.206830 0.1739286 8.060806
## 8  Sb02g000230 5.338813e-31 1.125528e-27  2.232334 0.1928113 6.291518
## 9  Sb03g030350 4.500870e-31 1.125528e-27 -2.343266 0.2021373 5.910535
## 10 Sb04g009380 5.458164e-31 1.125528e-27  1.942344 0.1677918 7.059961
## 11 Sb01g017890 1.096846e-30 2.056188e-27  2.341897 0.2033618 7.637174
## 12 Sb01g027570 1.384164e-30 2.378571e-27 -1.930588 0.1679378 7.136449
## 13 Sb02g032815 3.040858e-29 4.823503e-26  1.907134 0.1698861 7.027473
## 14 Sb06g027520 3.836596e-28 5.651032e-25  2.868165 0.2607508 7.737337
## 15 Sb03g032450 7.029695e-28 9.663956e-25  2.177463 0.1989478 5.740736
## 16 Sb10g030520 1.173675e-27 1.512647e-24 -2.013254 0.1847302 6.132523
## 17 Sb04g022780 2.010690e-27 2.438967e-24  2.160773 0.1991633 5.748936
## 18 Sb05g001050 1.072051e-25 1.228153e-22  2.865335 0.2734204 5.363238
## 19 Sb06g032740 3.879851e-25 4.210863e-22  1.916693 0.1850582 8.834977
## 20 Sb09g022360 1.010259e-24 1.041627e-21 -1.752569 0.1707278 6.744899
##      var_obs     tech_var    sigma_sq smooth_sigma_sq final_sigma_sq
## 1  7.8845625 0.0269575462 0.024337209      0.06230006     0.06230006
## 2  2.3338385 0.0023944573 0.042294830      0.04226623     0.04229483
## 3  2.1115843 0.0007626147 0.011702703      0.04270977     0.04270977
## 4  2.1413497 0.0038299837 0.034191835      0.04606639     0.04606639
## 5  2.5846350 0.0096086891 0.031150422      0.05999660     0.05999660
## 6  3.4131056 0.0240408402 0.056821080      0.07820005     0.07820005
## 7  1.4973313 0.0005313628 0.044845374      0.04402345     0.04484537
## 8  1.5396059 0.0025378948 0.053226399      0.04539551     0.05322640
## 9  1.6963006 0.0045887113 0.056700543      0.05099744     0.05670054
## 10 1.1633176 0.0011494398 0.038234436      0.04108168     0.04108168
## 11 1.6949722 0.0009949900 0.061039015      0.04199324     0.06103902
## 12 1.1237184 0.0012428359 0.005717086      0.04106183     0.04106183
## 13 1.1257817 0.0012789022 0.042013051      0.04110683     0.04201305
## 14 2.5494999 0.0010455669 0.100940886      0.04236779     0.10094089
## 15 1.4431366 0.0048453369 0.021070718      0.05452499     0.05452499
## 16 1.2276397 0.0037157803 0.010887106      0.04747208     0.04747208
## 17 1.4346596 0.0051609291 0.037311354      0.05433813     0.05433813
## 18 2.5527531 0.0122589251 0.099879168      0.06436212     0.09987917
## 19 1.1292376 0.0002294607 0.033675849      0.05114035     0.05114035
## 20 0.9428903 0.0019108458 0.024889917      0.04181114     0.04181114

The easiest way to view and interact with the results is to generate the sleuth live site that allows for exploratory data analysis

sleuth_live(so)