RNASeq report

  1. the Unnormalized rollup data is placed in directory - <results>/summarization/results_no_Norm.
  1. the quantile normalized data is placed in directory - <results>/summarization/results_quantileNorm
  1. rpm (Reads Per Million Reads) values for genes are also available after rollup in the same directories. Simplistic rpm counts prior to rollup are available for the study in <results>/raw_counts/simplistic_sample_level_rpm

 

Samples and conditions

There are a total of 53 samples in this study. The count represents the number of “sample.ids” in the study

     

Investigation of the assay level data

Prior to investigation of the gene counts generated by the Lilly RNASeq rollup pipeline, it may be useful to review the counts for each assay. Occasionally, there may be multiple assays for a sample but in other studies there may be just a single assay to the sample.

Specifically, in this study all assays for each sample have correlations above 0.80. Thus the assay level gene count data can be combined for each sample using roll up

 

Principal Component Analysis of the gene counts from each assay

 

Correlation between RNASeq metrics (like total read count, 3’bias) with the PCs from the assay level gene counts

 

Investigation of the variance in the RNASeq gene counts from each assay. These are the raw gene counts from each assay. To elaborate this analysis is before the gene counts are put through the Lilly pipeline of rollup to aggregate the exon and junction counts from assays to genes per sample. This prior to rollup analysis helps determine if any assays were identified as problematic and need further attention on a case-by-case situation.
Each assay is marked as an outlier by PC analysis.
An outlier is marked as an observation that is numerically distant (1.5 times the interquartile range above the upper quartile and below the lower quartile) from the rest of the data. The first two PCs are investigated for association with the various pre-calculated RNASeq metrics. This attempt may address the question of “why” certain assays appear to be outliers.

 

Investigation of the sample level data

This point forward the gene counts for the samples after the Lilly RNASeq rollup are investigated for outliers

Expression distributions boxplot

     

Sample Outliers

Lilly internal pipeline issues warnings on samples which are reported in the design file. In addition,sample level PC outliers are generated after gene counts are obtained after the Lilly rollup pipeline. This table provides a list of outliers.
Sample.id Sample warning reason Additional Reason sample.group_name
1 G039-5 greater than 30% mitochondrial content Pretreatment
2 G064-3 greater than 30% mitochondrial content During treatment of RAM monotherapy
3 G064-4 very biased base composition|greater than 80% mitochondrial content PC outlier During treatment of RAM monotherapy
4 G082-3 PC outlier During treatment of PTX plus RAM
5 G082-4 PC outlier During treatment of PTX plus RAM
6 G127-6 very biased base composition|greater than 80% mitochondrial content PC outlier During treatment of PTX plus RAM
7 G132-2 potential herpes virus infection|potential fungal contamination PC outlier PTX
8 G132-3 PC outlier During treatment of RAM monotherapy
9 G132-6 greater than 30% mitochondrial content Post PTX RAM
10 NCCHE-G057-2 Patient is reported by investigator to be male but appears to be female Pretreatment
11 NCCHE-G057-3 Patient is reported by investigator to be male but appears to be female Post PTX RAM
12 NCCHE-G057-5 Patient is reported by investigator to be male but appears to be female During treatment of RAM monotherapy
13 NCCHE-G136 PC outlier Pretreatment
14 NCCHE-G164-2 PC outlier During treatment of PTX plus RAM

 

Principal component analysis of the quantile normalized rollup gene counts

     

Samples G082-3, G082-4, G132-2, G132-3, NCCHE-G136, G064-4, G127-6, NCCHE-G164-2 are PC outliers. Carefully, consider these before proceding with downstream analysis. In numerous studies these samples should be dropped from the analysis.

An outlier is marked as an observation that is numerically distant (1.5 times the interquartile range above the upper quartile and below the lower quartile) from the rest of the data.

     

Outliers marked in RNASeq metrics boxplots

After the internal lilly pipeline there is no expectation to see correlation with RNASeq metrices. Therefore, outliers for various RNASeq metrices are calculated and presented if the PC outliers are overlap. This attempt may be suggestive to explaining the samples as outliers.

Again, each plot allows investigation of the PC outlier samples to various RNASeq metrics. This may explain why some of the samples are outliers.

     

Ethnicity

Ancestry informative markers are used to inform the ethnicity of the samples.

     

Quantile normalized rollup counts (rlm) and simplistic sample-level rpm distributions

As reminder in this run given the default parameters for Lilly RNASeq rollup, there is removal of exons that have 80th percentile of summed raw read counts < 10. The densityplots show the gene count distribution of the Lilly RNASeq rollup counts and simplistic sample level rpm counts. Clear evidence of the removal of the low expressed genes from the Lilly RNASeq rollup.

 

Genes with low expression (default parameters) are removed in the Lilly RNASeq rollup pipeline. There are 4072 genes with low expression that were removed from Lilly RNASeq rollup counts. So, the total genes in the simplistic sample level rpm is 25885.

## character(0)
Does running the filter eliminate some interesting genes? This table includes median expression of certain genes which shows difference in expression in specific sample groups. Please evaluate if these genes should be included in main analysis. As these genes are not present in the current Lilly RNASeq rollup pipeline (default parameters). An option is to rerun rollup with different settings.
In this study we can re-evaluate the expression of 93 genes

   

tSNE sample clustering

## Read the 52 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 2.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 52
## Done in 0.00 seconds (sparsity = 0.161982)!
## Learning embedding...
## Iteration 50: error is 79.812387 (50 iterations in 0.01 seconds)
## Iteration 100: error is 78.237445 (50 iterations in 0.01 seconds)
## Iteration 150: error is 68.266009 (50 iterations in 0.00 seconds)
## Iteration 200: error is 72.253209 (50 iterations in 0.01 seconds)
## Iteration 250: error is 71.060945 (50 iterations in 0.00 seconds)
## Iteration 300: error is 3.046715 (50 iterations in 0.01 seconds)
## Iteration 350: error is 2.382132 (50 iterations in 0.01 seconds)
## Iteration 400: error is 1.912596 (50 iterations in 0.00 seconds)
## Iteration 450: error is 2.211615 (50 iterations in 0.01 seconds)
## Iteration 500: error is 2.010348 (50 iterations in 0.01 seconds)
## Fitting performed in 0.07 seconds.
## [1] "perplexity = 2, max_iter = 500, learning rate = 200"

## Read the 52 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 5.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 52
## Done in 0.00 seconds (sparsity = 0.389793)!
## Learning embedding...
## Iteration 50: error is 66.111092 (50 iterations in 0.01 seconds)
## Iteration 100: error is 69.386717 (50 iterations in 0.00 seconds)
## Iteration 150: error is 65.174801 (50 iterations in 0.01 seconds)
## Iteration 200: error is 63.650356 (50 iterations in 0.01 seconds)
## Iteration 250: error is 69.414769 (50 iterations in 0.00 seconds)
## Iteration 300: error is 2.505615 (50 iterations in 0.01 seconds)
## Iteration 350: error is 2.130702 (50 iterations in 0.00 seconds)
## Iteration 400: error is 1.794287 (50 iterations in 0.01 seconds)
## Iteration 450: error is 1.830252 (50 iterations in 0.01 seconds)
## Iteration 500: error is 1.522245 (50 iterations in 0.00 seconds)
## Fitting performed in 0.06 seconds.
## [1] "perplexity = 5, max_iter = 500, learning rate = 200"

## Read the 52 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 10.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 52
## Done in 0.01 seconds (sparsity = 0.723373)!
## Learning embedding...
## Iteration 50: error is 59.716732 (50 iterations in 0.00 seconds)
## Iteration 100: error is 58.419821 (50 iterations in 0.01 seconds)
## Iteration 150: error is 57.084640 (50 iterations in 0.01 seconds)
## Iteration 200: error is 59.159695 (50 iterations in 0.00 seconds)
## Iteration 250: error is 62.227891 (50 iterations in 0.01 seconds)
## Iteration 300: error is 1.441920 (50 iterations in 0.00 seconds)
## Iteration 350: error is 0.942671 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.900135 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.643039 (50 iterations in 0.00 seconds)
## Iteration 500: error is 0.429907 (50 iterations in 0.01 seconds)
## Fitting performed in 0.06 seconds.
## [1] "perplexity = 10, max_iter = 500, learning rate = 200"

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)

 

As a heuristic cutoff, tSNE was run on the top 7000 most variant genes. In detail the median absolute deviation (MAD) was used as a measure of variability. Each of the 7000 genes also had a minimum mean expression greater than 3 counts (~10 reads). Please change the default settings in tSNE and run numerous iterations before interpreting the clusters.